WO2016141773A1 - 一种近端语音信号检测方法及装置 - Google Patents

一种近端语音信号检测方法及装置 Download PDF

Info

Publication number
WO2016141773A1
WO2016141773A1 PCT/CN2016/070253 CN2016070253W WO2016141773A1 WO 2016141773 A1 WO2016141773 A1 WO 2016141773A1 CN 2016070253 W CN2016070253 W CN 2016070253W WO 2016141773 A1 WO2016141773 A1 WO 2016141773A1
Authority
WO
WIPO (PCT)
Prior art keywords
input signal
signal
determining
time point
end speech
Prior art date
Application number
PCT/CN2016/070253
Other languages
English (en)
French (fr)
Inventor
梁民
韩波
Original Assignee
电信科学技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 电信科学技术研究院 filed Critical 电信科学技术研究院
Publication of WO2016141773A1 publication Critical patent/WO2016141773A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Definitions

  • the present disclosure relates to the field of voice signal detection technologies, and in particular, to a near-end voice signal detection method and apparatus.
  • AEC Acoustic Echo canceller
  • the echo path is adaptively mathematically modeled by a filter, and an effective estimate of an acoustic echo is synthesized, and then the acoustic is subtracted from the output signal of the received signal of the microphone.
  • a valid estimate of the echo thereby achieving the purpose of acoustic echo cancellation.
  • DT double-talk
  • a natural processing method is that the learning algorithm of the filter coefficient vector of the filter should be stopped in the case of double talk, and will continue if no double talk occurs.
  • the double talk detector came into being.
  • the double talk detector is mainly based on cross correlation. (Cross-Correlation) guidelines are implemented. In the DTD based on cross-correlation criteria, there are two typical technical solutions:
  • the first scheme utilizes the error signal e(n) and the far-end speech signal vector in the acoustic echo canceller Cross-correlation is used to perform double-talk detection, error signal e(n) and far-end speech signal vector
  • the relationship between the two is as follows:
  • the impulse response of the linear portion of the echo path of the acoustic echo L being the length of the echo path;
  • the impulse response of the filter The autocorrelation matrix for the far-end speech signal.
  • Equation (2) It is highly dependent on the change in the echo path and is therefore suitable for detecting whether the acoustic echo path has changed, rather than to detect if double talk has occurred.
  • the second scheme using the far-end speech signal vector A cross-correlation with the microphone output signal y(n) is used to construct a decision statistic for double talk detection.
  • Vector Cross-correlation between y(n) Can be expressed as:
  • the ambient noise and the power of the near-end speech signal are the ambient noise and the power of the near-end speech signal.
  • the decision statistic ⁇ Benesty is defined as dividing the formula (5) by the formula (4) and then pre-opening, ie:
  • a threshold parameter T Benesty can be defined.
  • the cross-correlation between the error signal e(n) and the microphone output signal y(n) can also be utilized to construct a decision statistic for the DTD, specifically, between the error signal e(n) and the microphone output signal y(n).
  • the cross-correlation is defined as:
  • a threshold parameter T Iqbal can be defined.
  • the double-talk detection techniques introduced above are based on the following two assumptions: 1.
  • the nonlinear distortion in the acoustic echo path is small and negligible; 2.
  • the ambient noise is stationary.
  • the nonlinear distortion caused by the amplifier overload and the codec is not negligible, so that the performance of the double-talk detection technology based on the cross-correlation technique in the related art is poor.
  • the noise in the actual environment is not stable. This non-stationary nature will further aggravate the deterioration of the performance of this type of double-talk detection technology, and sometimes it is impossible to detect whether double talk is normal.
  • Some embodiments of the present disclosure provide a method and apparatus for detecting a near-end speech signal to improve double talk detection performance.
  • Some embodiments of the present disclosure provide a near-end speech signal detecting method, including:
  • the first input signal is a signal obtained by linearly or non-linearly transforming a far-end signal received by the mobile terminal
  • the second input signal is the movement The near-end signal received by the terminal
  • the first input signal is an echo estimation signal output by an adaptive filter of the mobile terminal, and the echo estimation signal is linear or non-linear to the remote signal by the adaptive filter. Linear filtering.
  • the first input signal is a signal obtained after the far-end signal is linearly delayed.
  • determining, according to the distance, whether there is a near-end speech signal in the second input signal including:
  • the method further includes:
  • Some embodiments of the present disclosure provide a near-end speech signal detecting method, including:
  • first input signal is a far-end signal received by a mobile terminal
  • second input signal is a near-end signal received by the mobile terminal
  • Extracting a first voiceprint feature of the first input signal and extracting the first input signal is greater than the second threshold value, and the second input signal is greater than the third threshold value Determining a distance between the first voiceprint feature and the second voiceprint feature by describing a second voiceprint feature of the second input signal And determining whether a near-end speech signal is present in the second input signal according to the distance.
  • the detecting whether the first input signal is greater than a second threshold, and detecting whether the second input signal is greater than a third threshold includes:
  • determining, according to the distance, whether there is a near-end speech signal in the second input signal including:
  • the distance is less than the fourth threshold, determining that the second input signal does not have a near-end speech signal at the second time point, otherwise determining that the second input signal is present at the second time point Near-end speech signal.
  • it also includes:
  • the second input signal is less than the third threshold, determining that the second input signal does not have a near-end speech signal at the second time point;
  • the method further includes:
  • Some embodiments of the present disclosure provide a near-end speech signal detecting apparatus, including:
  • a receiving unit configured to receive a first input signal and a second input signal, where the first input signal is a signal obtained by linearly or non-linearly transforming a far-end signal received by the mobile terminal, the second input The signal is a near-end signal received by the mobile terminal;
  • An extracting unit configured to extract a first voiceprint feature of the first input signal and a second voiceprint feature of the second input signal
  • a determining unit configured to determine a distance between the first voiceprint feature and the second voiceprint feature, and determine, according to the distance, whether a near-end voice signal exists in the second input signal.
  • the first input signal is an echo estimation signal output by an adaptive filter of the mobile terminal, and the echo estimation signal is linear or non-linear to the remote signal by the adaptive filter. Linear filtering.
  • the first input signal is a signal obtained after the far-end signal is linearly delayed.
  • the determining unit is specifically configured to:
  • the determining unit is further configured to:
  • Some embodiments of the present disclosure provide a near-end speech signal detecting apparatus, including:
  • a receiving unit configured to receive a first input signal and a second input signal, where the first input signal is a far-end signal received by the mobile terminal, and the second input signal is a near-end received by the mobile terminal signal;
  • a detecting unit configured to detect whether the first input signal is greater than a second threshold and detect whether the second input signal is greater than a third threshold
  • a determining unit configured to extract a first voiceprint of the first input signal when determining that the first input signal is greater than the second threshold and the second input signal is greater than the third threshold a feature, and extracting a second voiceprint feature of the second input signal, determining a distance between the first voiceprint feature and the second voiceprint feature, and determining the second input signal based on the distance Whether there is a near-end speech signal.
  • the detecting unit is configured to:
  • the determining unit is specifically configured to:
  • the determining unit is further configured to:
  • the second input signal is less than the third threshold, determining that the second input signal does not have a near-end speech signal at the second time point;
  • the determining unit is further configured to:
  • a method and apparatus for extracting a first voiceprint feature of a far-end signal and a second voiceprint feature of the near-end signal, by comparing the first voiceprint feature with the second voiceprint feature Determine if a double talk has occurred. Since some embodiments of the present disclosure perform the decision of the near-end speech signal based on the voiceprint characteristics of the far-end signal and the near-end signal, that is, whether or not the double talk is generated, there is no cross-correlation technique and two assumptions as in the prior art. (1. The nonlinear distortion in the acoustic echo path is small and neglected; 2.
  • the ambient noise is stationary) to perform double-talk detection, thus avoiding the nonlinearity in the prior art to assume the acoustic echo path to some extent.
  • the distortion is small and neglected, and it is assumed that the environmental noise is stable as a precondition to detect whether a misjudgment occurs when double talk occurs, thereby achieving a more accurate double talk detection.
  • FIG. 1 is a schematic structural view of an acoustic echo canceler in the prior art
  • FIG. 2 is a schematic flowchart of a method for detecting a near-end speech signal according to some embodiments of the present disclosure
  • FIG. 3 is a schematic diagram of a voiceprint feature extraction process provided by some embodiments of the present disclosure.
  • FIG. 4 is a schematic flowchart of a method for detecting a near-end speech signal according to some embodiments of the present disclosure
  • FIG. 5 is a structural diagram of a near-end speech signal detecting apparatus according to some embodiments of the present disclosure.
  • FIG. 6 is a schematic flowchart of a method for detecting a near-end speech signal according to some embodiments of the present disclosure
  • FIG. 7 is a schematic diagram of an application scenario of a second near-end speech signal detecting apparatus according to some embodiments of the present disclosure.
  • FIG. 8 is a structural diagram of a near-end speech signal detecting apparatus according to some embodiments of the present disclosure.
  • FIG. 9 is a structural diagram of a near-end speech signal detecting apparatus according to some embodiments of the present disclosure.
  • FIG. 1 it is a schematic structural diagram of an acoustic echo canceler in the prior art, including a speaker 101, an adaptive filter 102, a double talk detector 103, and a microphone 104.
  • the amplifier overload and codec in the speaker 101 causes the far-end speech signal x(n) to be nonlinearly distorted; the far-end speech signal x(n) is transmitted from the speaker 101 to During the operation of the microphone 104, the acoustic echo path transmitted between the speaker 101 and the microphone 104 also affects the far end speech signal x(n).
  • y(n) is the received signal of the microphone 104
  • u(n) is the near-end speech signal
  • v(n) is the system noise
  • x1(n) is the far-end speech signal x(n) after the nonlinear impulse response
  • x2(n) is the echo signal, which is determined by:
  • the echo signal x2(n) fed to the microphone 104 by the speaker 101 is estimated by the adaptive filter 102, and the estimated signal is obtained. as follows:
  • the coefficient vector of the adaptive filter 102 is the coefficient vector of the adaptive filter 102.
  • Coefficient vector of adaptive filter 102 It is obtained through adaptive algorithm learning, Convergence Under the condition that the echo signal x2(n) in the error signal e(n) is cancelled, the purpose of eliminating the echo signal is achieved.
  • the near-end speech signal u(n) appears, that is, double talk occurs, since the near-end speech signal u(n) is statistically uncorrelated with the far-end speech signal x(n), the near-end speech signal u(n) For the far-end speech signal x(n), it is like a burst interference signal, resulting in the coefficient vector of the adaptive filter 102.
  • the adaptive learning algorithm diverges, resulting in a large residual echo in the error signal e(n).
  • the coefficient vector of the adaptive filter 102 is stopped when it is detected that the double talk occurs.
  • the update is such that a large residual echo will occur in the error signal e(n).
  • the two hypothetical conditions will be discarded, and the double-talk detection will be implemented from another angle.
  • the following describes in detail how the double-talk detection method provided by some embodiments of the present disclosure detects whether double talk occurs. It should be noted that the double talk detection method provided by some embodiments of the present disclosure is not only applied to a teleconferencing system with an acoustic echo canceller, a hands-free communication terminal, etc., but also can be applied to other devices and systems. The application scenario is not limited here.
  • some embodiments of the present disclosure provide a near-end speech signal detecting method, including:
  • Step 201 Receive a first input signal and a second input signal, where the first input signal is a signal obtained by linearly or non-linearly transforming a far-end signal received by the mobile terminal, where The two input signals are near-end signals received by the mobile terminal;
  • Step 202 Extract a first voiceprint feature of the first input signal and a second voiceprint feature of the second input signal;
  • Step 203 Determine a distance between the first voiceprint feature and the second voiceprint feature
  • Step 204 Determine whether a near-end speech signal exists in the second input signal according to the distance.
  • the mobile terminal in some embodiments of the present disclosure may be a device such as a mobile phone, a tablet computer, a conference phone, or the like.
  • the first input signal is a signal obtained by linearly or non-linearly transforming the far-end signal received by the mobile terminal.
  • the far-end signal is a signal that is encoded, modulated, and needs to be played by a device such as a speaker.
  • the second input signal is a signal received by an audio receiving sensor such as a microphone, and may include one of an acoustic echo signal formed by an echo path, an ambient noise signal, and a near-end speech signal.
  • the acoustic echo signal in the second input signal is a signal that needs to be cancelled.
  • the second input signal includes an acoustic echo signal formed by the far-end signal passing through the echo path, a certain delay is generated, resulting in a non-synchronization with the far-end signal. If the delay signal is not processed for the far-end signal, the direct use is directly adopted.
  • the double-talk detection of the far-end signal and the second input signal reduces the accuracy of the detection. It is therefore necessary to linearly transform or non-linearly transform the far-end signal to form a first input signal that is synchronized with the acoustic echo signal in the second input signal.
  • the first input signal may be an echo estimation signal output by the adaptive filter of the mobile terminal, and the echo estimation signal is obtained by linearly or nonlinearly filtering the far-end signal by the adaptive filter;
  • the delay time unit delays the far-end speech signal, and the delayed far-end speech signal is used as the first input signal.
  • the delay unit matches the delay of the signal with the delay of the echo path, and the delay unit can be determined by the acoustic echo path delay estimation algorithm, and the delay unit can also be determined by other methods. This disclosure is not limited thereto.
  • step 201 before obtaining the first input signal and the second input signal, it is also possible to detect whether there is a voice signal in the input first input signal and/or the second input signal, and the first input is not obtained.
  • the filter coefficient of the adaptive filter in the mobile terminal may stop updating the coefficient to save power consumption; when the obtained first input signal includes a voice signal, If there is a near-end speech signal in the second input signal, the filter coefficient of the adaptive filter in the mobile terminal may stop updating the coefficient. If the near-end speech signal does not exist in the second input signal, it may be directly determined that the non-occurrence occurs. Double talk, at this time, the adaptive filter in the mobile terminal needs to update the filter coefficients according to the residual signal.
  • VAD voice activity detection
  • step 202 after obtaining the first input signal and the second input signal, the first voiceprint feature of the first input signal and the second voiceprint feature of the second input signal are respectively extracted.
  • Voiceprint is the spectrum of sound waves carrying voice information. Because the generators used in speech are different in size and shape, there are differences in the voiceprints of any two people. On the other hand, the human ear can Hearing speech signals in noisy background noise and various variations, this feature is due to the fact that the cochlea is essentially equivalent to a filter bank whose filtering is performed on a logarithmic frequency scale. Therefore, the human ear is more sensitive to low frequency signals than to high frequency signals.
  • a Mel-Frequency Cepstral Coefficient (MFCC) of the Mel frequency is selected as the voiceprint characteristic parameter of the speech signal, Perform double talk detection.
  • the basic principle is: firstly extract the MFCC feature parameter vectors of the first input signal and the second input signal, and then calculate the distance between them, and judge whether there is double talk according to the distance.
  • the second input signal contains only the echo signal, and thus the distance between the MFCC characteristic parameter vectors of the first input signal and the second input signal is small; in the case of double talk, the first The two input signals include not only the near-end speech signal u(n) but also the echo signal (on the premise of having a far-end speech signal), and the MFCC characteristic parameter vector of the first input signal and the second input signal at this time The distance between them is large.
  • the DTD based on the voiceprint characteristic parameters proposed by the present disclosure is nonlinear to the ambient noise and the acoustic echo path. Degeneration, better Robustness.
  • voiceprint features extracted from the audio signal include, but are not limited to, MFCC, which may be any characteristic parameter that can effectively characterize and identify the signal, and the noise pollution of the signal to the signal and Nonlinear distortion has better resistance.
  • the input signal is pre-emphasized according to a pre-emphasis function to obtain a pre-emphasized input signal; the pre-emphasized input signal is windowed by a window function, and the windowed window is calculated a spectrum of the input signal; filtering a spectrum of the windowed input signal through a Mel filter bank, and performing discrete cosine transform on the filtered spectrum of the windowed input signal to obtain the input signal Voiceprint features.
  • some embodiments of the present disclosure provide a flow chart for extracting voiceprint features.
  • Step 301 pre-emphasis processing
  • the input signal is pre-emphasized by a pre-emphasis function, and the pre-emphasis function is:
  • 0.9 ⁇ 1.0 is a pre-emphasis coefficient
  • is generally 0.95
  • x(n) is an input signal, which may be a first input signal or a second input signal
  • z(n) is a pre-emphasized input signal.
  • Pre-emphasis of the input signal can enhance the high-frequency component of the signal and compensate for the influence of the glottal pulse shape and lip radiation on the speech signal, thereby improving the accuracy of the detection.
  • Step 302 windowing
  • the window signal is used to window the pre-emphasized input signal to obtain the windowed input signal z(n)w(n); where w(n) is a window function of length N, which can be a Hamming window function, Gaussian Window functions, rectangular window functions, etc.
  • Step 303 Calculate the spectrum
  • Step 304 Mel filter bank filtering
  • H m (k) is the frequency response function of the mth filter of the Mel filter bank, which is defined as:
  • f m is the center frequency of the mth Mel filter, which is defined by:
  • f low and f high are the lowest and highest frequencies of the Mel filter bank
  • Fs is the sampling rate
  • M is the number of filter banks
  • Step 305 taking a logarithm
  • Step 306 Discrete cosine transform
  • the voiceprint feature vector extracted from the input signal for:
  • step 203 the first voiceprint feature is calculated according to equation (20). And second voiceprint features The distance between D:
  • step 204 when the first voiceprint feature And second voiceprint features
  • the threshold T in order to distinguish from other thresholds, the threshold may be referred to as the first threshold
  • determining that the second input signal includes a near-end speech signal that is, double talk, otherwise it is determined that there is no double talk, that is, in a single lecture state, as shown in equation (21):
  • the indication information is sent to the adaptive filter of the mobile terminal, the indication information being used to instruct the adaptive filter to pause updating the filter coefficients.
  • FIG. 4 and FIG. 5 respectively show schematic diagrams of two specific application scenarios.
  • Figure 4 shows the use of a microphone output signal y(n) and an adaptive filter output signal An embodiment for performing double talk detection.
  • the far-end input signal x(n) is filtered by an adaptive filter.
  • y(n) is the microphone output signal.
  • Output signal y(n) to the microphone The voiceprint feature extraction is performed separately, and the extracted voiceprint feature vector is matched. If the voiceprint feature vector of the two signals is pattern-matched, it is judged as a single-talk state; otherwise, it is judged as a double-talk state.
  • the voiceprint feature vector extracted here may be an MFCC type feature parameter, or any other type of feature parameter that can effectively characterize and identify the input signal.
  • the "pattern matching" technique used may be a distance matching technique between feature vectors, or may be other "similarity" matching techniques between feature vectors.
  • Figure 5 shows an embodiment of double talk detection using the microphone output signal y(n) and the far end input signal x(n).
  • the feature extraction is performed after delay processing of the delay unit by x(n), and the length of the delay is determined by the acoustic echo path delay estimation algorithm, and the feature extraction is performed on y(n);
  • the extracted voiceprint feature vector is matched. If the voiceprint feature vector of the two signals is pattern matched, it is judged as a single-talk state; otherwise, it is judged as a double-talk state.
  • the voiceprint feature vector extracted here may be an MFCC type feature parameter, or any other type of feature parameter that can effectively characterize and identify the input signal.
  • the "pattern matching" technique used may be a distance matching technique between feature vectors, or may be other "similarity" matching techniques between feature vectors.
  • the first input is considered to be the first input when the first voiceprint feature is similar to the second voiceprint feature.
  • Both the signal and the second input signal comprise a far-end signal, and the second input signal does not include a near-end speech signal, so that it can be considered that no double talk occurs, otherwise it is considered that double talk occurs.
  • the speech signal Since the speech signal is a non-stationary signal, it appears as a discontinuous signal in the time domain or the frequency domain. Therefore and It is not necessary to always detect the first voiceprint feature of the first input signal or the second voiceprint feature of the second input signal, and may first detect whether there is a voice signal in the first input signal or the second input signal, if there is a voice signal, Then, the voiceprint feature of the first input signal or the second input signal is extracted. The details are described below by way of specific embodiments.
  • a method for detecting a near-end speech signal includes:
  • Step 601 Receive a first input signal and a second input signal, where the first input signal is a far-end signal received by the mobile terminal, and the second input signal is a near-end signal received by the mobile terminal;
  • Step 602 Detect whether the first input signal is greater than a second threshold, and detect whether the second input signal is greater than a third threshold;
  • Step 603 If the first input signal is greater than the second threshold, and the second input signal is greater than the third threshold, extracting a first voiceprint feature of the first input signal, And extracting a second voiceprint feature of the second input signal, determining a distance between the first voiceprint feature and the second voiceprint feature, and determining whether the second input signal is determined according to the distance There is a near-end speech signal.
  • the mobile terminal in some embodiments of the present disclosure may be a device such as a mobile phone, a tablet computer, a conference phone, or the like.
  • the first input signal received in step 601 is a far end signal.
  • the far-end signal is a signal that is encoded, modulated, and needs to be played by a device such as a speaker.
  • the second input signal that is, the near-end signal
  • the second input signal is a signal received by an audio receiving sensor such as a microphone, and may include one of an acoustic echo signal formed by an echo path, an ambient noise signal, and a near-end speech signal.
  • the acoustic echo signal in the second input signal is a signal that needs to be cancelled.
  • the signals of the first input signal and the second input signal respectively have the characteristics of the voice signal, and there are various methods for detecting, which may be detected by a voice activity detection algorithm, or may be detected by other methods. Some embodiments disclosed are not limited thereto.
  • the second threshold value may be a short-term energy difference between the preset signal energy and the noise energy ratio.
  • the first An input signal is a speech signal.
  • the third threshold value may be a short-term energy difference between the preset signal energy and the noise energy ratio.
  • step 602 since the second input signal includes an acoustic echo signal formed by the far-end signal passing through the echo path, a certain delay is generated, resulting in an unsynchronization with the first input signal, and the second input signal is relatively There is a certain lag in the first input signal. If the first input signal is not subjected to the delay processing, and the first input signal is directly detected to be greater than the second threshold, the detection result of the first input signal and the second input signal after the delay is detected. Comparing; if the first input signal is subjected to delay processing, the detection result of the first input signal needs to be compared with the detection result of the second input signal at the same time point.
  • the length of the delay time can be determined according to the actual situation.
  • the value of the length of the delay can be divided into the following two cases:
  • the first type does not delay processing the first input signal, and the time length of the delay is greater than 0, that is, the second time point is the time point after the first time point; the specific value of the time length of the delay may be The delay of the far-end signal in the echo path is determined;
  • the first input signal is subjected to delay processing, and the time length of the delay is equal to 0, that is, the second time point coincides with the first time point.
  • step 603 the detection results of the first input signal and the second input signal can be classified into the following three cases:
  • the first input signal is less than the second threshold and the second input signal is greater than the third threshold, determining that the second input signal has a near-end speech signal at the second time point.
  • first input signal is greater than the second threshold, and the second input signal is greater than the third threshold, And extracting a first voiceprint feature of the first input signal, and extracting a second voiceprint feature of the second input signal, determining a distance between the first voiceprint feature and the second voiceprint feature, and determining the second input signal according to the distance Whether there is a near-end speech signal.
  • the fourth threshold value herein may be the same as or different from the “first threshold value” in the flow shown in FIG. 2 .
  • FIG. 7 shows a schematic diagram of two specific application scenarios.
  • Figure 7 shows an embodiment based on VAD and using the microphone output signal y(n) and the far-end input signal x(n) for double-talk detection.
  • VAD monitoring is performed on the far-end input signal x(n). If there is a speech signal, the voiceprint feature vector VPx is extracted for the signal x(n), otherwise, no processing is performed.
  • the VAD monitoring is performed on the microphone output signal y(n) in the downlink, and if there is a speech signal, the voiceprint feature vector VPy is extracted for the signal y(n), otherwise, no processing is performed.
  • the voiceprint feature vector VPx is available, it waits until the voiceprint feature vector VPy is available for pattern matching processing. details as follows:
  • the value of the VAD marking the downlink at the tth time is DL_VAD(t)
  • the value of the VAD of the uplink at the tth time is UL_VAD(t)
  • the voiceprint recognition technology to decide whether it is double talk.
  • the voiceprint feature vector extracted here may be an MFCC type feature parameter, or any other type of feature parameter that can effectively characterize and identify the input signal.
  • the "pattern matching" technique used may be a distance matching technique between feature vectors, or may be other "similarity” matching techniques between feature vectors.
  • some embodiments of the present disclosure further provide a near-end speech signal detecting device, and the specific content of the device may be implemented by referring to the foregoing method, and details are not described herein again.
  • some embodiments of the present disclosure provide a near-end speech signal detecting apparatus, including:
  • the receiving unit 801 is configured to receive the first input signal and the second input signal, where the first input signal is a signal obtained by linearly or non-linearly transforming the far-end signal received by the mobile terminal, the second The input signal is a near-end signal received by the mobile terminal;
  • An extracting unit 802 configured to extract a first voiceprint feature of the first input signal and a second voiceprint feature of the second input signal
  • the determining unit 803 is configured to determine a distance between the first voiceprint feature and the second voiceprint feature, and determine whether a near-end voice signal exists in the second input signal according to the distance.
  • the first input signal is an echo estimation signal output by an adaptive filter of the mobile terminal, and the echo estimation signal is linear or non-linear to the remote signal by the adaptive filter. Linear filtering.
  • the first input signal is a signal obtained after the far-end signal is linearly delayed.
  • the determining unit 803 is specifically configured to:
  • the determining unit 803 is further configured to:
  • some embodiments of the present disclosure provide a near-end speech signal detecting apparatus, including:
  • the receiving unit 901 is configured to receive the first input signal and the second input signal, where the first input signal is a far-end signal received by the mobile terminal, and the second input signal is a near-received by the mobile terminal Terminal signal
  • the detecting unit 902 is configured to detect whether the first input signal is greater than a second threshold, and check Detecting whether the second input signal is greater than a third threshold;
  • a determining unit 903 configured to: when determining that the first input signal is greater than the second threshold, and the second input signal is greater than the third threshold, extracting a first one of the first input signals a voiceprint feature, and a second voiceprint feature for extracting the second input signal, determining a distance between the first voiceprint feature and the second voiceprint feature, and determining the second according to the distance Whether there is a near-end speech signal in the input signal.
  • the detecting unit 902 is configured to:
  • the determining unit 903 is specifically configured to:
  • the distance is less than the fourth threshold, determining that the second input signal does not have a near-end speech signal at the second time point, otherwise determining that the second input signal is present at the second time point Near-end speech signal.
  • the determining unit 903 is further configured to:
  • the second input signal is less than the third threshold, determining that the second input signal does not have a near-end speech signal at the second time point;
  • the determining unit 903 is further configured to:
  • the first voiceprint feature of the far-end voice signal and the second voice of the output signal of the audio receiving device are extracted.
  • the pattern feature determines whether a double talk occurs by comparing the first voiceprint feature with the second voiceprint feature.
  • the present disclosure can be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware aspects. Moreover, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage and optical storage, etc.) including computer usable program code.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • the apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephone Function (AREA)
  • Filters That Use Time-Delay Elements (AREA)

Abstract

一种近端语音信号检测方法及装置,该方法包括:接收第一输入信号以及第二输入信号(201),所述第一输入信号通过对远端语音信号进行线性变换或非线性变换获得,所述第二输入信号为音频接收传感器的输出信号;根据所述第一输入信号确定出所述第一输入信号的第一声纹特征矢量,根据所述第二输入信号确定出所述第二输入信号的第二声纹特征矢量(202);确定出所述第一声纹特征与所述第二声纹特征之间的距离(203),并根据所述距离确定是否发生双讲。

Description

一种近端语音信号检测方法及装置
相关申请的交叉引用
本申请主张在2015年3月9日在中国提交的中国专利申请号No.201510102968.X的优先权,其全部内容通过引用包含于此。
技术领域
本公开涉及语音信号检测技术领域,尤其涉及一种近端语音信号检测方法及装置。
背景技术
声学回波抵消器(Acoustic Echo canceller,AEC)是电话会议系统、免提通信终端等设备的一个重要模块,用来抵消由扬声器到麦克风的声学耦合反馈效应,即扬声器到麦克风之间的声学回波。
在声学回波抵消器中,用一个滤波器对回波路径进行自适应地数学建模,并由此合成一个声学回波的有效估计,然后在麦克风的接收信号的输出信号中减去该声学回波的有效估计,从而实现声学回波抵消的目的。当麦克风的接收信号中出现近端语音信号时,即发生双讲(Double-Talk,DT)情形,由于它与远端语音信号统计上不相关,因而其犹如一个突发的噪声,使得滤波器的系数将偏离实际声学回波路径所对应的真值而发生发散现象。这便相应地增大了回波残留量,使声学回波抵消器的性能恶化。为使声学回波抵消器的工作性能稳定可靠,准确而及时地检测出麦克风接收信号中是否发生双讲,是一项非常重要和必要的任务。在DT发生的条件下,滤波器系数的自适应学习必须停止进行,以避免在该情况下滤波器系数持续学习所致的发散现象。
为克服这一问题,一种自然的处理方法是:滤波器的滤波器系数矢量的学习算法应该在发生双讲的情况下被停止执行,而在未发生双讲时将持续进行。由此,双讲检测器(DTD)便应运而生。目前,双讲检测器主要是基于互相关 (Cross-Correlation)准则实现的。在基于互相关准则的DTD中,较典型的技术方案有以下两种:
第一种方案,利用声学回波抵消器中的误差信号e(n)和远端语音信号矢量
Figure PCTCN2016070253-appb-000001
之间互相关来进行双讲检测,误差信号e(n)和远端语音信号矢量
Figure PCTCN2016070253-appb-000002
之间互相关系数如下:
Figure PCTCN2016070253-appb-000003
在由放大器过载和编码解码器引入的非线性失真可以忽略不计,以及环境噪声是平稳的假设条件下(在无特别注明的情况下,以下均假设该条件成立),式(1)变为:
Figure PCTCN2016070253-appb-000004
其中,
Figure PCTCN2016070253-appb-000005
为声学回波的回波路径中线性部分的冲击响应,L为回波路径的长度;
Figure PCTCN2016070253-appb-000006
为滤波器的冲击响应;
Figure PCTCN2016070253-appb-000007
为远端语音信号的自相关矩阵。
式(2)中的
Figure PCTCN2016070253-appb-000008
高度依赖于回波路径的变化,因而适合用于检测声学回波路径是否发生变化,而不是用来检测双讲是否发生。
第二种方案,利用远端语音信号矢量
Figure PCTCN2016070253-appb-000009
和麦克风输出信号y(n)之间的互相关来构造一个决策统计量用于双讲检测。矢量
Figure PCTCN2016070253-appb-000010
和y(n)之间的互相关
Figure PCTCN2016070253-appb-000011
可表达为:
Figure PCTCN2016070253-appb-000012
考虑到麦克风输出信号y(n)的方差
Figure PCTCN2016070253-appb-000013
可表示成下式:
Figure PCTCN2016070253-appb-000014
其中
Figure PCTCN2016070253-appb-000015
Figure PCTCN2016070253-appb-000016
分别为环境噪声和近端语音信号的功率。
在无DT,即u(n)=0时,式(4)即为:
Figure PCTCN2016070253-appb-000017
将决策统计量ξBenesty定义为用式(5)除以式(4)后再开方,即:
Figure PCTCN2016070253-appb-000018
根据式(6)可以确定,在无双讲时,决策统计量ξBenesty取值为1;在有双讲时,决策统计量ξBenesty取值小于1。因此可定义一个门限值参数TBenesty,当ξBenesty<TBenesty,则确定发生双讲;否则,确定无双讲发生。
还可以利用误差信号e(n)和麦克风输出信号y(n)之间的互相关来构造DTD的决策统计量,具体的,将误差信号e(n)和麦克风输出信号y(n)之间的互相关定义为:
Figure PCTCN2016070253-appb-000019
构造的决策统计量ξIqbal如下:
Figure PCTCN2016070253-appb-000020
在滤波器收敛时,滤波器的冲击响应趋于回波路径的冲击响应,即
Figure PCTCN2016070253-appb-000021
那么在无双讲的情况下ξIqbal≈1,而在有双讲时ξIqbal<1。因此可定义一个门限值参数TIqbal,当ξIqbal<TIqbal,则确定发生双讲;否则,就确定未发生双讲。
上述介绍的双讲检测技术都是基于以下两个假设:1、声学回波路径中非线性失真很小而忽略不计;2、环境噪声是平稳的。然而实际系统中,由于放大器过载和编码解码器所引发的非线性失真不可忽略,使得相关技术中的基于互相关技术的双讲检测技术的性能较差。此外,实际环境中的噪声也并非是平稳的,这一非平稳性也将进一步加剧该类双讲检测技术性能的恶化程度,乃至有时无法正常检测出是否发生双讲。
发明内容
本公开的一些实施例提供了一种近端语音信号检测方法及装置,用以提高双讲检测性能。
本公开的一些实施例提供了一种近端语音信号检测方法,包括:
接收第一输入信号以及第二输入信号,其中,所述第一输入信号为移动终端接收到的远端信号被线性或非线性变换后所得到的信号,所述第二输入信号为所述移动终端接收到的近端信号;
提取所述第一输入信号的第一声纹特征以及所述第二输入信号的第二声纹特征;
确定所述第一声纹特征与所述第二声纹特征之间的距离;以及
根据所述距离确定所述第二输入信号中是否存在近端语音信号。
可选地,所述第一输入信号为所述移动终端的自适应滤波器输出的回波估计信号,所述回波估计信号是所述自适应滤波器对所述远端信号进行线性或非线性滤波得到的。
可选地,所述第一输入信号为所述远端信号被线性延时后得到的信号。
可选地,所述根据所述距离确定所述第二输入信号中是否存在近端语音信号,包括:
判断所述距离是否小于第一门限值,若是,则确定所述第二输入信号中不存在近端语音信号,否则,确定所述第二输入信号中存在近端语音信号。
可选地,所述确定所述第二输入信号中存在近端语音信号之后,还包括:
向所述移动终端的自适应滤波器发送指示信息,其中,所述指示信息用于指示所述自适应滤波器暂停更新滤波器系数。
本公开的一些实施例提供了一种近端语音信号检测方法,包括:
接收第一输入信号以及第二输入信号,其中,所述第一输入信号为移动终端接收到的远端信号,所述第二输入信号为所述移动终端接收到的近端信号;以及
检测所述第一输入信号是否大于第二门限值,以及检测所述第二输入信号是否大于第三门限值;
若所述第一输入信号大于所述第二门限值,且所述第二输入信号大于所述第三门限值,则提取所述第一输入信号的第一声纹特征,以及提取所述第二输入信号的第二声纹特征,确定所述第一声纹特征与所述第二声纹特征之间的距 离,并根据所述距离确定所述第二输入信号中是否存在近端语音信号。
可选地,所述检测所述第一输入信号是否大于第二门限值,以及检测所述第二输入信号是否大于第三门限值,包括:
检测所述第一输入信号在第一时间点是否大于所述第二门限值,以及检测所述第二输入信号在第二时间点是否大于所述第三门限值,其中,所述第二时间点为所述第一时间点经过延时后的时间点。
可选地,所述根据所述距离确定所述第二输入信号中是否存在近端语音信号,包括:
若所述距离小于第四门限值,则确定所述第二输入信号在所述第二时间点不存在近端语音信号,否则,确定所述第二输入信号在所述第二时间点存在近端语音信号。
可选地,还包括:
若所述第二输入信号小于所述第三门限值,则确定所述第二输入信号在所述第二时间点不存在近端语音信号;或者,
若所述第一输入信号小于所述第二门限值,且所述第二输入信号大于所述第三门限值,则确定所述第二输入信号在所述第二时间点存在近端语音信号。
可选地,确定所述第二输入信号中存在近端语音信号之后,还包括:
向所述移动终端的自适应滤波器发送指示信息,其中,所述指示信息用于指示所述自适应滤波器暂停更新滤波器系数。
本公开的一些实施例提供了一种近端语音信号检测装置,包括:
接收单元,用于接收第一输入信号以及第二输入信号,其中,所述第一输入信号为移动终端接收到的远端信号被线性或非线性变换后所得到的信号,所述第二输入信号为所述移动终端接收到的近端信号;
提取单元,用于提取所述第一输入信号的第一声纹特征以及所述第二输入信号的第二声纹特征;以及
确定单元,用于确定所述第一声纹特征与所述第二声纹特征之间的距离,并根据所述距离确定所述第二输入信号中是否存在近端语音信号。
可选地,所述第一输入信号为所述移动终端的自适应滤波器输出的回波估计信号,所述回波估计信号是所述自适应滤波器对所述远端信号进行线性或非线性滤波得到的。
可选地,所述第一输入信号为所述远端信号被线性延时后得到的信号。
可选地,所述确定单元具体用于:
判断所述距离是否小于第一门限值,若是,则确定所述第二输入信号中不存在近端语音信号,否则,确定所述第二输入信号中存在近端语音信号。
可选地,所述确定单元还用于:
向所述移动终端的自适应滤波器发送指示信息,其中,所述指示信息用于指示所述自适应滤波器暂停更新滤波器系数。
本公开的一些实施例提供了一种近端语音信号检测装置,包括:
接收单元,用于接收第一输入信号以及第二输入信号,其中,所述第一输入信号为移动终端接收到的远端信号,所述第二输入信号为所述移动终端接收到的近端信号;
检测单元,用于检测所述第一输入信号是否大于第二门限值以及检测所述第二输入信号是否大于第三门限值;以及
确定单元,用于在确定所述第一输入信号大于所述第二门限值且所述第二输入信号大于所述第三门限值时,提取所述第一输入信号的第一声纹特征,以及提取所述第二输入信号的第二声纹特征,确定所述第一声纹特征与所述第二声纹特征之间的距离,并根据所述距离确定所述第二输入信号中是否存在近端语音信号。
可选地,所述检测单元用于:
检测所述第一输入信号在第一时间点是否大于所述第二门限值,以及检测所述第二输入信号在第二时间点是否大于所述第三门限值,其中,所述第二时间点为所述第一时间点经过延时后的时间点。
可选地,所述确定单元具体用于:
若所述距离小于第四门限值,则确定所述第二输入信号在所述第二时间点 不存在近端语音信号,否则,确定所述第二输入信号在所述第二时间点存在近端语音信号。
可选地,所述确定单元还用于:
若所述第二输入信号小于所述第三门限值,则确定所述第二输入信号在所述第二时间点不存在近端语音信号;或者,
若所述第一输入信号小于所述第二门限值,且所述第二输入信号大于所述第三门限值,则确定所述第二输入信号在所述第二时间点存在近端语音信号。
可选地,所述确定单元还用于:
向所述移动终端的自适应滤波器发送指示信息,其中,所述指示信息用于指示所述自适应滤波器暂停更新滤波器系数。
根据本公开的一些实施例提供的方法及装置,提取远端信号的第一声纹特征,以及近端信号中的第二声纹特征之后,通过对比第一声纹特征与第二声纹特征确定是否发生双讲。由于本公开的一些实施例是根据远端信号和近端信号的声纹特征来进行近端语音信号的判决,即判断是否发生双讲,没有像现有技术一样基于互相关技术以及两个假设(1、声学回波路径中非线性失真很小而忽略不计;2、环境噪声是平稳的)来进行双讲检测,因此一定程度上避免了现有技术中以假设声学回波路径中非线性失真很小而忽略不计以及假设环境噪声平稳为前提条件去检测是否发生双讲时产生的误判等情况,从而更准确的实现双讲检测。
附图说明
图1为现有技术中声学回波抵消器的结构示意图;
图2为本公开的一些实施例提供的一种近端语音信号检测方法流程示意图;
图3为本公开的一些实施例提供的声纹特征提取流程示意图;
图4为本公开的一些实施例提供的一种近端语音信号检测方法流程示意图;
图5为本公开的一些实施例提供的一种近端语音信号检测装置结构图;
图6为本公开的一些实施例提供的一种近端语音信号检测方法流程示意图;
图7为本公开的一些实施例提供的第二种近端语音信号检测装置应用场景示意图;
图8为本公开的一些实施例提供的一种近端语音信号检测装置结构图;
图9为本公开的一些实施例提供的一种近端语音信号检测装置结构图。
具体实施方式
如图1所示,为现有技术中声学回波抵消器的结构示意图,包括扬声器101,自适应滤波器102,双讲检测器103,麦克风104。远端语音信号x(n)从扬声器101输出时,扬声器101中放大器过载和编码解码器会导致远端语音信号x(n)非线性失真;远端语音信号x(n)从扬声器101传输到麦克风104的过程中,扬声器101传输到麦克风104之间的声学回波路径也会对远端语音信号x(n)产生影响。
现假设导致远端语音信号x(n)非线性失真的非线性冲击响应很小,可以忽略不计,那么有:
Figure PCTCN2016070253-appb-000022
其中,y(n)为麦克风104的接收信号,u(n)为近端语音信号,v(n)为系统噪声,x1(n)为远端语音信号x(n)经过非线性冲击响应后的语音信号,它们均为零均值;x2(n)为回波信号,由下式确定:
Figure PCTCN2016070253-appb-000023
其中,
Figure PCTCN2016070253-appb-000024
为扬声器101到麦克风104之间的声学回波路径中线性部分的冲击响应,L为回波路径的长度;
Figure PCTCN2016070253-appb-000025
这时用自适应滤波器102对扬声器101馈入麦克风104的回波信号x2(n)进行估计,得估计信号
Figure PCTCN2016070253-appb-000026
如下:
Figure PCTCN2016070253-appb-000027
其中,
Figure PCTCN2016070253-appb-000028
为自适应滤波器102的系数矢量。
Figure PCTCN2016070253-appb-000029
从麦克风104的输出信号y(n)中减去,获得相应的误差信号e(n)为:
Figure PCTCN2016070253-appb-000030
自适应滤波器102的系数矢量
Figure PCTCN2016070253-appb-000031
是通过自适应算法学习获得的,在
Figure PCTCN2016070253-appb-000032
收敛于
Figure PCTCN2016070253-appb-000033
的条件下,误差信号e(n)中的回波信号x2(n)会被抵消,从而达到消除回波信号的目的。当近端语音信号u(n)出现,即发生双讲时,由于近端语音信号u(n)与远端语音信号x(n)之间统计上不相关,因此近端语音信号u(n)对于远端语音信号x(n)来说犹如一个突发干扰信号,致使自适应滤波器102的系数矢量
Figure PCTCN2016070253-appb-000034
的自适应学习算法发散,由此导致误差信号e(n)中将出现较大的残留回波。
目前通过检测双讲是否发生,并在检测到双讲发生时停止自适应滤波器102的系数矢量
Figure PCTCN2016070253-appb-000035
的更新,从而避免导致误差信号e(n)中将出现较大的残留回波。
现有技术中,在检测双讲是否发生时,都是基于以下两个假设:1、声学回波路径中非线性失真很小而忽略不计;2、环境噪声是平稳的。然而,实际情况中,声学回波路径中非线性失真往往很大,或者环境噪声非常不平稳,导致基于这两个假设条件的双讲检测技术的性能很不稳定,有时无法正常检测出是否发生双讲。
本公开的一些实施例中将摒弃这两个假设条件,从另外一个角度去实现双讲检测,下面详细描述本公开的一些实施例提供的双讲检测方法是如何检测双讲是否发生。需要说明的是,本公开的一些实施例提供的双讲检测方法并不仅仅是应用于带有声学回波抵消器的电话会议系统、免提通信终端等设备,还可以应用于其他设备和系统,在此并不限定其应用场景。
如图2所示,本公开的一些实施例提供的一种近端语音信号检测方法,该方法包括:
步骤201:接收第一输入信号以及第二输入信号,其中,所述第一输入信号为移动终端接收到的远端信号被线性或非线性变换后所得到的信号,所述第 二输入信号为所述移动终端接收到的近端信号;
步骤202:提取所述第一输入信号的第一声纹特征以及所述第二输入信号的第二声纹特征;
步骤203:确定所述第一声纹特征与所述第二声纹特征之间的距离;以及
步骤204:根据所述距离确定所述第二输入信号中是否存在近端语音信号。
本公开的一些实施例中的移动终端可以为手机、平板电脑、会议电话等设备。
在步骤201中,第一输入信号为移动终端接收到的远端信号被线性或非线性变换后所得到的信号。远端信号是经过编码、调制,并需要被扬声器等设备播放的信号。
第二输入信号,即近端信号,是由麦克风等音频接收传感器接收到的信号,可能包括远端信号经过回声路径形成的声学回波信号、环境噪声信号以及近端语音信号中的一种或多种组合,第二输入信号中的声学回波信号是需要消除的信号。第二输入信号中包含由远端信号经过回声路径形成的声学回波信号时,会产生一定的延时,导致与远端信号之间不同步,如果不对远端信号进行延时处理,直接采用远端信号与第二输入信号进行双讲检测,会降低检测的准确性。因此需要将远端信号进行线性变换或非线性变换,形成与第二输入信号中声学回波信号同步的第一输入信号。
实现将远端信号进行线性变换或非线性变换形成第一输入信号的方法有多种。第一输入信号可以为移动终端的自适应滤波器输出的回波估计信号,所述回波估计信号是所述自适应滤波器对所述远端信号进行线性或非线性滤波得到的;也可以通过延时单元对远端语音信号延时,将延时后的远端语音信号作为第一输入信号。需要说明的是,该延时单元对信号的延时与回波路径的延时相匹配,可以通过声学回声路径延时估计算法确定出延时单元,也可以通过其他方法确定出延时单元,本公开对此并不限定。
在步骤201中,获得第一输入信号以及第二输入信号之前,还可以检测输入的第一输入信号和\或第二输入信号中是否有语音信号,在未获得第一输入 信号或者获得的第一输入信号中不包含语音信号时,移动终端中的自适应滤波器的滤波器系数可以停止系数的更新,以便节省功耗;获得的第一输入信号中包含语音信号时,若第二输入信号中存在近端语音信号时,移动终端中的自适应滤波器的滤波器系数可以停止系数的更新,若第二输入信号中不存在近端语音信号时,可以直接确定未发生双讲,此时移动终端中的自适应滤波器需要根据残差信号进行滤波器系数的更新。
检测输入的第一输入信号和\或第二输入信号中是否有语音信号的方法有多种,例如可以通过语音活动检测(Voice activity detection,VAD)来检测输入的信号是否包含语音信号。
步骤202中,在获得第一输入信号以及第二输入信号之后,分别提取第一输入信号的第一声纹特征以及第二输入信号第二声纹特征。
声纹(Voiceprint)是携带语音信息的声波频谱,由于人在讲话时使用的发生器在尺寸和形态方面各自有差异,所以任何两个人的声纹都存在差异;另一方面,人耳能在吵杂的背景噪声中及各种变异的情况下听到语音信号,该特性是得益于这样一个事实:耳蜗实质上相当于一个滤波器组,其滤波作用是在对数频率尺度上进行的,从而使得人耳对低频信号比对高频信号更敏感。综合考虑人耳的听觉感知和人的语音产生的机理,在本公开的一些实施例中选择Mel频率的倒谱系数(Mel-Frequency Cepstral Coefficient,MFCC)作为语音信号的声纹特征参数,用来进行双讲检测。其基本原理是:首先分别提取第一输入信号和第二输入信号的MFCC特征参数矢量,然后计算它们之间的距离,根据距离判断有无发生双讲。在未发生双讲的情况下,第二输入信号中仅含回波信号,因而第一输入信号和第二输入信号的MFCC特征参数矢量间的距离较小;在发生双讲的情况下,第二输入信号中不仅含近端语音信号u(n),而且还可能包含回波信号(在有远端语音信号的前提下),此时第一输入信号和第二输入信号的MFCC特征参数矢量间的距离较大。由于声纹特征参数对声学回波路径中的非线性失真和噪声干扰具有较强的不敏感特性,因而本公开提出的基于声纹特征参数之DTD对环境噪声和声学回波路径中的非线性退变,具有较好 的鲁棒性。
需要说明的是,本公开的一些实施例中从音频信号中提取的声纹特征包括但不限于MFCC,可以是能有效表征和鉴别信号的任何特征参数,并且该类参数对信号的噪声污染和非线性畸变具有较好的抵免性。
针对一个输入信号,根据预加重函数对所述输入信号进行预加重,获得预加重后的输入信号;通过窗函数对所述预加重后的输入信号进行加窗,并计算所述加窗后的输入信号的频谱;通过Mel滤波器组对所述加窗后的输入信号的频谱进行滤波,并对滤波后的所述加窗后的输入信号的频谱进行离散余弦变换,获得所述输入信号的声纹特征。
具体地,如图3所示,本公开的一些实施例提供的提取声纹特征流程图。
步骤301:预加重处理;
将输入信号通过预加重函数进行预加重处理,预加重函数为:
z(n)=x(n)-α·x(n-1)     (13)
其中,0.9<α<1.0为预加重系数,α一般取0.95,x(n)为输入信号,可以为第一输入信号或者第二输入信号,z(n)为预加重后的输入信号。对输入信号进行预加重可以提升信号的高频分量进而补偿声门脉冲形状和口唇辐射对语音信号产生的影响,从而提高检测的准确性。
步骤302:加窗;
通过窗函数对预加重后的输入信号进行加窗,获得加窗后的输入信号z(n)w(n);其中w(n)为长度N的窗函数,可以为汉明窗函数、高斯窗函数、矩形窗函数等。
步骤303:计算频谱;
对加窗后的输入信号进行离散傅立叶变换,获得第t帧输入信号的频谱Z(t,k):
Figure PCTCN2016070253-appb-000036
步骤304:Mel滤波器组滤波;
采用M组Mel滤波器{Hm(k),m=0,1,2,…,M-1}对Z(t,k)进行处理,每个Mel滤波器的输出能量E(t,m)为:
Figure PCTCN2016070253-appb-000037
这里Hm(k)为Mel滤波器组第m个滤波器的频响函数,它定义为:
Figure PCTCN2016070253-appb-000038
其中fm为第m个Mel滤波器的中心频率,它由下式定义:
Figure PCTCN2016070253-appb-000039
式(17)中flow和fhigh分别为Mel滤波器组的最低和最高频率,Fs为采样率,M为滤波器组的数目,函数
Figure PCTCN2016070253-appb-000040
步骤305:取对数;
首先对式(17)取对数,获得Mel滤波器组中每个滤波器输出的对数能量S(t,m):
S(t,m)=logeE(t,m),m=0,1,…,M-1     (18)
步骤306:离散余弦变换:
然后对(18)式经离散余弦变换(DCT)变换得MFCC的系数如下:
Figure PCTCN2016070253-appb-000041
由此提取到输入信号的的声纹特征矢量
Figure PCTCN2016070253-appb-000042
为:
Figure PCTCN2016070253-appb-000043
根据上述提取信号声纹特征的流程,可以提取第一输入信号的第一声纹特征
Figure PCTCN2016070253-appb-000044
以及第二输入信号第二声纹特征
Figure PCTCN2016070253-appb-000045
在步骤203中,根据式(20)计算第一声纹特征
Figure PCTCN2016070253-appb-000046
与第二声纹特征
Figure PCTCN2016070253-appb-000047
之间的距离D:
Figure PCTCN2016070253-appb-000048
其中,||·||为矢量的范数,可为1-范数、2-范数或者∞-范数。
最后,在步骤204中,当第一声纹特征
Figure PCTCN2016070253-appb-000049
与第二声纹特征
Figure PCTCN2016070253-appb-000050
之间的距离D大于或等于门限值T时(为了与其他门限值相区别,此处可称该门限值为第一门限值),确定第二输入信号中包含近端语音信号,即发生双讲,否则确定未发生双讲,即处于单讲状态,具体如式(21)所示:
Figure PCTCN2016070253-appb-000051
在确定发生双讲之后,向移动终端的自适应滤波器发送指示信息,所述指示信息用于指示所述自适应滤波器暂停更新滤波器系数。
根据以上图2所示流程的描述,图4和图5分别示出了两种具体应用场景的示意图。
图4示出了利用麦克风输出信号y(n)和自适应滤波器输出信号
Figure PCTCN2016070253-appb-000052
来进行双讲检测的实施例。如图4所示,远端输入信号x(n)经过自适应滤波器滤波后形成
Figure PCTCN2016070253-appb-000053
y(n)为麦克风输出信号。对麦克风输出信号y(n)和
Figure PCTCN2016070253-appb-000054
分别进行声纹特征提取,对所提取的声纹特征矢量进行匹配处理,若两路信号的声纹特征矢量是模式匹配的,则判为单讲状态;否则,判为双讲状态。这里所提取的声纹特征矢量可以是MFCC型特征参数,也可以是能有效表征和鉴别输入信号的任何其它类型的特征参数。所采用的“模式匹配”技术可以是特征矢量间的距离匹配技术,也可以是特征矢量间的其它“相似度”匹配技术。
图5给出了利用麦克风输出信号y(n)和远端输入信号x(n)来进行双讲检测的实施例。如图所示,对x(n)通过延时单元进行延时处理后进行特征提取,所延时的长度由声学回声路径延时估计算法决定,并对y(n)进行特征提取;然后,对所提取的声纹特征矢量进行匹配处理,若两路信号的声纹特征矢量是模式匹配的,则判为单讲状态;否则,判为双讲状态。这里所提取的声纹特征矢量可以是MFCC型特征参数,也可以是能有效表征和鉴别输入信号的任何其它类型的特征参数。所采用的“模式匹配”技术可以是特征矢量间的距离匹配技术,也可以是特征矢量间的其它“相似度”匹配技术。
上述实施例中,通过将第一输入信号的第一声纹特征与第二输入信号的第二声纹特征进行比较,在第一声纹特征与第二声纹特征相近时,认为第一输入信号与第二输入信号中均包含远端信号,且第二输入信号中不包含近端语音信号,因此可以认为并未发生双讲,否则认为发生双讲。
由于语音信号是非平稳信号,表现在时域或频域上为非连续信号。因此并 不需要一直检测第一输入信号的第一声纹特征,或第二输入信号的第二声纹特征,可以先检测第一输入信号或第二输入信号中是否有语音信号,如果存在语音信号,则提取第一输入信号或第二输入信号的声纹特征。下面通过具体的实施例来详细描述。
如图6所示,本公开的一些实施例提供的一种近端语音信号检测方法,包括:
步骤601:接收第一输入信号以及第二输入信号,其中,所述第一输入信号为移动终端接收到的远端信号,所述第二输入信号为所述移动终端接收到的近端信号;
步骤602:检测所述第一输入信号是否大于第二门限值,以及检测所述第二输入信号是否大于第三门限值;以及
步骤603:若所述第一输入信号大于所述第二门限值,且所述第二输入信号大于所述第三门限值,则提取所述第一输入信号的第一声纹特征,以及提取所述第二输入信号的第二声纹特征,确定所述第一声纹特征与所述第二声纹特征之间的距离,并根据所述距离确定所述第二输入信号中是否存在近端语音信号。
本公开的一些实施例中的移动终端可以为手机、平板电脑、会议电话等设备。
在步骤601中接收到的第一输入信号为远端信号。远端信号是经过编码、调制,并需要被扬声器等设备播放的信号。
第二输入信号,即近端信号,是由麦克风等音频接收传感器接收到的信号,可能包括远端信号经过回声路径形成的声学回波信号、环境噪声信号以及近端语音信号中的一种或多种组合,第二输入信号中的声学回波信号是需要消除的信号。
在步骤602中,分别检测第一输入信号以及第二输入信号中是否具有语音信号特征的信号,检测的方法有多种,可以通过语音活性检测算法进行检测,也可以通过其他方法进行检测,本公开的一些实施例对此并不限定。
在步骤602中,第二门限值可以是预设的信号能量与噪声能量比的短时能量差,当检测到第一输入信号的短时能量差高于第二门限值时,确定第一输入信号为语音信号。对应的,第三门限值可以是预设的信号能量与噪声能量比的短时能量差
在步骤602中,由于第二输入信号中包含由远端信号经过回声路径形成的声学回波信号时,会产生一定的延时,导致与第一输入信号之间不同步,第二输入信号相对于第一输入信号有一定的滞后。如果不对第一输入信号进行延时处理,直接检测第一输入信号是否大于所述第二门限值,那么需要将第一输入信号的检测结果与第二输入信号在经过延时后的检测结果相比较;如果对第一输入信号进行延时处理,那么需要将第一输入信号的检测结果与同一时间点第二输入信号的检测结果相比较。
综上所述,检测第一输入信号在第一时间点是否大于所述第二门限值,以及检测第二输入信号在第二时间点是否大于第三门限值,其中,第二时间点为第一时间点经过延时后的时间点,延时的时间长度可以根据实际情况确定。由上面的描述可知,延时的时间长度的取值可以分为下面两种情况:
第一种,不对第一输入信号进行延时处理,此时延时的时间长度大于0,即第二时间点为第一时间点之后的时间点;延时的时间长度的具体取值可以根据远端信号在回波路径中的延时确定;
第二种,对第一输入信号进行延时处理,此时延时的时间长度等于0,即第二时间点与第一时间点重合。
最后,在步骤603中,对第一输入信号和第二输入信号的检测结果可以分为以下三种情况:
一、若第二输入信号小于第三门限值,则确定第二输入信号在第二时间点不存在近端语音信号;
二、若第一输入信号小于第二门限值,且第二输入信号大于第三门限值,则确定第二输入信号在第二时间点存在近端语音信号。
三、若第一输入信号大于第二门限值,且第二输入信号大于第三门限值, 则提取第一输入信号的第一声纹特征,以及提取第二输入信号的第二声纹特征,确定第一声纹特征与第二声纹特征的距离,根据距离确定所述第二输入信号中是否存在近端语音信号。
第三种情况中,若第一声纹特征与第二声纹特征的距离小于第四门限值,则确定第二输入信号在第二时间点不存在近端语音信号,否则,确定第二输入信号在所述第二时间点存在近端语音信号。其中,这里的“第四门限值”与图2所示流程中的“第一门限值”取值可以相同也可以不同。
具体如何提取第一输入信号的第一声纹特征,以及提取第二输入信号的第二声纹特征,可以参考前一实施例的描述,在此不再赘述。
当确定第二输入信号中存在近端语音信号之后,向移动终端的自适应滤波器发送指示信息,所述指示信息用于指示所述自适应滤波器暂停更新滤波器系数。
根据以上图6所示流程的描述,图7示出了两种具体应用场景的示意图。
图7给出了基于VAD并利用麦克风输出信号y(n)和远端输入信号x(n)来进行双讲检测的实施例。如图所示,对远端输入信号x(n)进行VAD监测,如果有语音信号,则对信号x(n)提取声纹特征矢量VPx,否则,不作处理。对下行链路中的麦克风输出信号y(n)进行VAD监测,如果有语音信号,则对信号y(n)提取声纹特征矢量VPy,否则,不作处理。在声纹特征矢量VPx可使用时开始等待直到声纹特征矢量VPy可使用时即刻进行模式匹配处理。具体如下:
为了方便,标记下行链路的VAD在第t个时刻的值为DL_VAD(t),上行链路的VAD在第t个时刻的值为UL_VAD(t),如果DL_VAD(t)=0并且UL_VAD(t)=1时,则判定为双讲;如果DL_VAD(t)=0并且UL_VAD(t)=0时,则判定为单讲;如果DL_VAD(t)=1并且UL_VAD(t+t0)=1(这里t0>0)时,则按声纹识别技术判决是否为双讲。这里所提取的声纹特征矢量可以是MFCC型特征参数,也可以是能有效表征和鉴别输入信号的任何其它类型的特征参数。所采用的“模式匹配”技术可以是特征矢量间的距离匹配技术,也可以是特征矢量间的其它“相似度”匹配技术。
针对上述方法流程,本公开的一些实施例还提供一种近端语音信号检测装置,该装置的具体内容可以参照上述方法实施,在此不再赘述。
如图8所示,本公开的一些实施例提供了一种近端语音信号检测装置,包括:
接收单元801,用于接收第一输入信号以及第二输入信号,其中,所述第一输入信号为移动终端接收到的远端信号被线性或非线性变换后所得到的信号,所述第二输入信号为所述移动终端接收到的近端信号;
提取单元802,用于提取所述第一输入信号的第一声纹特征以及所述第二输入信号的第二声纹特征;以及
确定单元803,用于确定所述第一声纹特征与所述第二声纹特征之间的距离,并根据所述距离确定所述第二输入信号中是否存在近端语音信号。
可选地,所述第一输入信号为所述移动终端的自适应滤波器输出的回波估计信号,所述回波估计信号是所述自适应滤波器对所述远端信号进行线性或非线性滤波得到的。
可选地,所述第一输入信号为所述远端信号被线性延时后得到的信号。
可选地,所述确定单元803具体用于:
判断所述距离是否小于第一门限值,若是,则确定所述第二输入信号中不存在近端语音信号,否则,确定所述第二输入信号中存在近端语音信号。
可选地,所述确定单元803还用于:
向所述移动终端的自适应滤波器发送指示信息,所述指示信息用于指示所述自适应滤波器暂停更新滤波器系数。
如图9所示,本公开的一些实施例提供了一种近端语音信号检测装置,包括:
接收单元901,用于接收第一输入信号以及第二输入信号,其中,所述第一输入信号为移动终端接收到的远端信号,所述第二输入信号为所述移动终端接收到的近端信号;
检测单元902,用于检测所述第一输入信号是否大于第二门限值,以及检 测所述第二输入信号是否大于第三门限值;以及
确定单元903,用于在确定所述第一输入信号大于所述第二门限值,且所述第二输入信号大于所述第三门限值时,提取所述第一输入信号的第一声纹特征,以及提取所述第二输入信号的第二声纹特征,确定所述第一声纹特征与所述第二声纹特征之间的距离,并根据所述距离确定所述第二输入信号中是否存在近端语音信号。
可选地,所述检测单元902用于:
检测所述第一输入信号在第一时间点是否大于所述第二门限值,以及检测所述第二输入信号在第二时间点是否大于所述第三门限值,其中,所述第二时间点为所述第一时间点经过延时之后的时间点。
可选地,所述确定单元903具体用于:
若所述距离小于第四门限值,则确定所述第二输入信号在所述第二时间点不存在近端语音信号,否则,确定所述第二输入信号在所述第二时间点存在近端语音信号。
可选地,所述确定单元903还用于:
若所述第二输入信号小于所述第三门限值,则确定所述第二输入信号在所述第二时间点不存在近端语音信号;或者,
若所述第一输入信号小于所述第二门限值,且所述第二输入信号大于所述第三门限值,则确定所述第二输入信号在所述第二时间点存在近端语音信号。
可选地,所述确定单元903还用于:
向所述移动终端的自适应滤波器发送指示信息,其中,所述指示信息用于指示所述自适应滤波器暂停更新滤波器系数。
综上所述,根据本公开的一些实施例提供的方法及装置,本公开的一些实施例中通过提取远端语音信号的第一声纹特征,以及音频接收设备的输出信号中的第二声纹特征,通过对比第一声纹特征与第二声纹特征确定是否发生双讲。通过本公开的一些实施例提供的方法,避免了现有技术中以假设声学回波路径中非线性失真很小而忽略不计以及假设环境噪声平稳为前提条件去检测是否 发生双讲时产生的误判等情况,从而更准确的实现双讲检测。
本领域内的技术人员应明白,本公开的一些实施例可提供为方法、系统、或计算机程序产品。因此,本公开可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本公开可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器和光学存储器等)上实施的计算机程序产品的形式。
本公开是参照根据本公开的一些实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
显然,本领域的技术人员可以对本公开进行各种改动和变型而不脱离本公开的精神和范围。这样,倘若本公开的这些修改和变型属于本公开的权利要求及其等同技术的范围之内,则本公开也意图包含这些改动和变型在内。

Claims (20)

  1. 一种近端语音信号检测方法,包括:
    接收第一输入信号以及第二输入信号,其中,所述第一输入信号为移动终端接收到的远端信号被线性或非线性变换后所得到的信号,所述第二输入信号为所述移动终端接收到的近端信号;
    提取所述第一输入信号的第一声纹特征以及所述第二输入信号的第二声纹特征;
    确定所述第一声纹特征与所述第二声纹特征之间的距离;以及
    根据所述距离确定所述第二输入信号中是否存在近端语音信号。
  2. 根据权利要求1所述的方法,其中,所述第一输入信号为所述移动终端的自适应滤波器输出的回波估计信号,其中,所述回波估计信号是所述自适应滤波器对所述远端信号进行线性或非线性滤波得到的。
  3. 根据权利要求1所述的方法,其中,所述第一输入信号为所述远端信号被线性延时后得到的信号。
  4. 根据权利要求1所述的方法,其中,所述根据所述距离确定所述第二输入信号中是否存在近端语音信号,包括:
    判断所述距离是否小于第一门限值,若是,则确定所述第二输入信号中不存在近端语音信号,否则,确定所述第二输入信号中存在近端语音信号。
  5. 根据权利要求1至4中任一项所述的方法,其中,所述确定所述第二输入信号中存在近端语音信号之后,还包括:
    向所述移动终端的自适应滤波器发送指示信息,其中,所述指示信息用于指示所述自适应滤波器暂停更新滤波器系数。
  6. 一种近端语音信号检测方法,包括:
    接收第一输入信号以及第二输入信号,其中,所述第一输入信号为移动终端接收到的远端信号,所述第二输入信号为所述移动终端接收到的近端信号;以及
    检测所述第一输入信号是否大于第二门限值,以及检测所述第二输入信号是否大于第三门限值;
    若所述第一输入信号大于所述第二门限值,且所述第二输入信号大于所述第三门限值,则提取所述第一输入信号的第一声纹特征,以及提取所述第二输入信号的第二声纹特征,确定所述第一声纹特征与所述第二声纹特征之间的距离,并根据所述距离确定所述第二输入信号中是否存在近端语音信号。
  7. 根据权利要求6所述的方法,其中,所述检测所述第一输入信号是否大于第二门限值,以及检测所述第二输入信号是否大于第三门限值,包括:
    检测所述第一输入信号在第一时间点是否大于所述第二门限值,以及检测所述第二输入信号在第二时间点是否大于所述第三门限值,其中,所述第二时间点为所述第一时间点经过延时后的时间点。
  8. 根据权利要求7所述的方法,其中,所述根据所述距离确定所述第二输入信号中是否存在近端语音信号,包括:
    若所述距离小于第四门限值,则确定所述第二输入信号在所述第二时间点不存在近端语音信号,否则,确定所述第二输入信号在所述第二时间点存在近端语音信号。
  9. 根据权利要求7所述的方法,还包括:
    若所述第二输入信号小于所述第三门限值,则确定所述第二输入信号在所述第二时间点不存在近端语音信号;或者,
    若所述第一输入信号小于所述第二门限值,且所述第二输入信号大于所述第三门限值,则确定所述第二输入信号在所述第二时间点存在近端语音信号。
  10. 根据权利要求6至9任一项所述的方法,其中,确定所述第二输入信号中存在近端语音信号之后,还包括:
    向所述移动终端的自适应滤波器发送指示信息,其中,所述指示信息用于指示所述自适应滤波器暂停更新滤波器系数。
  11. 一种近端语音信号检测装置,包括:
    接收单元,用于接收第一输入信号以及第二输入信号,其中,所述第一输 入信号为移动终端接收到的远端信号被线性或非线性变换后所得到的信号,所述第二输入信号为所述移动终端接收到的近端信号;
    提取单元,用于提取所述第一输入信号的第一声纹特征以及所述第二输入信号的第二声纹特征;以及
    确定单元,用于确定所述第一声纹特征与所述第二声纹特征之间的距离,并根据所述距离确定所述第二输入信号中是否存在近端语音信号。
  12. 根据权利要求11所述的装置,其中,所述第一输入信号为所述移动终端的自适应滤波器输出的回波估计信号,所述回波估计信号是所述自适应滤波器对所述远端信号进行线性或非线性滤波得到的。
  13. 根据权利要求11所述的装置,其中,所述第一输入信号为所述远端信号被线性延时后得到的信号。
  14. 根据权利要求11所述的装置,其中,所述确定单元具体用于:
    判断所述距离是否小于第一门限值,若是,则确定所述第二输入信号中不存在近端语音信号,否则,确定所述第二输入信号中存在近端语音信号。
  15. 根据权利要求11至14中任一项所述的装置,其中,所述确定单元还用于:
    向所述移动终端的自适应滤波器发送指示信息,其中,所述指示信息用于指示所述自适应滤波器暂停更新滤波器系数。
  16. 一种近端语音信号检测装置,包括:
    接收单元,用于接收第一输入信号以及第二输入信号,其中,所述第一输入信号为移动终端接收到的远端信号,所述第二输入信号为所述移动终端接收到的近端信号;
    检测单元,用于检测所述第一输入信号是否大于第二门限值以及检测所述第二输入信号是否大于第三门限值;以及
    确定单元,用于在确定所述第一输入信号大于所述第二门限值且所述第二输入信号大于所述第三门限值时,提取所述第一输入信号的第一声纹特征,以及提取所述第二输入信号的第二声纹特征,确定所述第一声纹特征与所述第二 声纹特征之间的距离,并根据所述距离确定所述第二输入信号中是否存在近端语音信号。
  17. 根据权利要求16所述的装置,其中,所述检测单元用于:
    检测所述第一输入信号在第一时间点是否大于所述第二门限值,以及检测所述第二输入信号在第二时间点是否大于所述第三门限值,其中,所述第二时间点为所述第一时间点经过延时后的时间点。
  18. 根据权利要求17所述的装置,其中,所述确定单元具体用于:
    若所述距离小于第四门限值,则确定所述第二输入信号在所述第二时间点不存在近端语音信号,否则,确定所述第二输入信号在所述第二时间点存在近端语音信号。
  19. 根据权利要求17所述的装置,其中,所述确定单元还用于:
    若所述第二输入信号小于所述第三门限值,则确定所述第二输入信号在所述第二时间点不存在近端语音信号;或者,
    若所述第一输入信号小于所述第二门限值,且所述第二输入信号大于所述第三门限值,则确定所述第二输入信号在所述第二时间点存在近端语音信号。
  20. 根据权利要求16至19任一项所述的装置,其中,所述确定单元还用于:
    向所述移动终端的自适应滤波器发送指示信息,其中,所述指示信息用于指示所述自适应滤波器暂停更新滤波器系数。
PCT/CN2016/070253 2015-03-09 2016-01-06 一种近端语音信号检测方法及装置 WO2016141773A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510102968.X 2015-03-09
CN201510102968.XA CN106033673B (zh) 2015-03-09 2015-03-09 一种近端语音信号检测方法及装置

Publications (1)

Publication Number Publication Date
WO2016141773A1 true WO2016141773A1 (zh) 2016-09-15

Family

ID=56879966

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/070253 WO2016141773A1 (zh) 2015-03-09 2016-01-06 一种近端语音信号检测方法及装置

Country Status (3)

Country Link
CN (1) CN106033673B (zh)
TW (1) TWI594234B (zh)
WO (1) WO2016141773A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109994116A (zh) * 2019-03-11 2019-07-09 南京邮电大学 一种基于会议场景小样本条件下的声纹准确识别方法
CN112259112A (zh) * 2020-09-28 2021-01-22 上海声瀚信息科技有限公司 一种结合声纹识别和深度学习的回声消除方法
CN114724572A (zh) * 2022-03-31 2022-07-08 杭州网易智企科技有限公司 确定回声延时的方法和装置

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109215672B (zh) * 2017-07-05 2021-11-16 苏州谦问万答吧教育科技有限公司 一种声音信息的处理方法、装置及设备
CN107610713B (zh) 2017-10-23 2022-02-01 科大讯飞股份有限公司 基于时延估计的回声消除方法及装置
CN113949977B (zh) * 2020-07-17 2023-08-11 通用微(深圳)科技有限公司 声音采集装置、声音处理设备及方法、装置、存储介质
CN117854517A (zh) * 2024-02-05 2024-04-09 南京龙垣信息科技有限公司 车载多人实时智能语音交互系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0640953A1 (en) * 1993-08-25 1995-03-01 Canon Kabushiki Kaisha Audio signal processing method and apparatus
CN1584977A (zh) * 2004-05-31 2005-02-23 中兴通讯股份有限公司 一种回声抑制器中近端话音检测的实现方法
CN102137194A (zh) * 2010-01-21 2011-07-27 华为终端有限公司 一种通话检测方法及装置
CN103337242A (zh) * 2013-05-29 2013-10-02 华为技术有限公司 一种语音控制方法和控制设备
CN103905656A (zh) * 2012-12-27 2014-07-02 联芯科技有限公司 残留回声的检测方法及装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2342832B (en) * 1998-08-04 2001-03-21 Motorola Inc Method and device for detecting near-end voice
US7558729B1 (en) * 2004-07-16 2009-07-07 Mindspeed Technologies, Inc. Music detection for enhancing echo cancellation and speech coding
JP5032669B2 (ja) * 2007-11-29 2012-09-26 テレフオンアクチーボラゲット エル エム エリクソン(パブル) 音声信号のエコーキャンセルのための方法及び装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0640953A1 (en) * 1993-08-25 1995-03-01 Canon Kabushiki Kaisha Audio signal processing method and apparatus
CN1584977A (zh) * 2004-05-31 2005-02-23 中兴通讯股份有限公司 一种回声抑制器中近端话音检测的实现方法
CN102137194A (zh) * 2010-01-21 2011-07-27 华为终端有限公司 一种通话检测方法及装置
CN103905656A (zh) * 2012-12-27 2014-07-02 联芯科技有限公司 残留回声的检测方法及装置
CN103337242A (zh) * 2013-05-29 2013-10-02 华为技术有限公司 一种语音控制方法和控制设备

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109994116A (zh) * 2019-03-11 2019-07-09 南京邮电大学 一种基于会议场景小样本条件下的声纹准确识别方法
CN109994116B (zh) * 2019-03-11 2021-01-19 南京邮电大学 一种基于会议场景小样本条件下的声纹准确识别方法
CN112259112A (zh) * 2020-09-28 2021-01-22 上海声瀚信息科技有限公司 一种结合声纹识别和深度学习的回声消除方法
CN114724572A (zh) * 2022-03-31 2022-07-08 杭州网易智企科技有限公司 确定回声延时的方法和装置

Also Published As

Publication number Publication date
CN106033673A (zh) 2016-10-19
CN106033673B (zh) 2019-09-17
TW201633292A (zh) 2016-09-16
TWI594234B (zh) 2017-08-01

Similar Documents

Publication Publication Date Title
WO2016141773A1 (zh) 一种近端语音信号检测方法及装置
US10535362B2 (en) Speech enhancement for an electronic device
US10475471B2 (en) Detection of acoustic impulse events in voice applications using a neural network
CN110770827B (zh) 基于相关性的近场检测器
EP2954513B1 (en) Ambient noise root mean square (rms) detector
EP2643834B1 (en) Device and method for producing an audio signal
US8898058B2 (en) Systems, methods, and apparatus for voice activity detection
JP6291501B2 (ja) 音響エコー除去のためのシステムおよび方法
CN106486135B (zh) 近端语音检测器、语音系统、对语音进行分类的方法
US20100278351A1 (en) Methods and systems for reducing acoustic echoes in multichannel communication systems by reducing the dimensionality of the space of impulse resopnses
CN104050971A (zh) 声学回声减轻装置和方法、音频处理装置和语音通信终端
GB2554955A (en) Detection of acoustic impulse events in voice applications
KR20090050372A (ko) 혼합 사운드로부터 잡음을 제거하는 방법 및 장치
US20190066654A1 (en) Adaptive suppression for removing nuisance audio
Hamidia et al. A new robust double-talk detector based on the Stockwell transform for acoustic echo cancellation
CN103903612A (zh) 一种实时语音识别数字的方法
US20200286501A1 (en) Apparatus and a method for signal enhancement
US20140341386A1 (en) Noise reduction
CN103905656B (zh) 残留回声的检测方法及装置
Lei et al. Deep neural network based regression approach for acoustic echo cancellation
CN110364175B (zh) 语音增强方法及系统、通话设备
WO2021190274A1 (zh) 回声声场状态确定方法及装置、存储介质、终端
WO2020015546A1 (zh) 一种远场语音识别方法、语音识别模型训练方法和服务器
Kamarudin et al. Acoustic echo cancellation using adaptive filtering algorithms for Quranic accents (Qiraat) identification
Ayrapetian et al. Asynchronous acoustic echo cancellation over wireless channels

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16761000

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16761000

Country of ref document: EP

Kind code of ref document: A1