WO2020181766A1 - 一种语音信号处理方法、装置、设备及可读存储介质 - Google Patents

一种语音信号处理方法、装置、设备及可读存储介质 Download PDF

Info

Publication number
WO2020181766A1
WO2020181766A1 PCT/CN2019/111319 CN2019111319W WO2020181766A1 WO 2020181766 A1 WO2020181766 A1 WO 2020181766A1 CN 2019111319 W CN2019111319 W CN 2019111319W WO 2020181766 A1 WO2020181766 A1 WO 2020181766A1
Authority
WO
WIPO (PCT)
Prior art keywords
signal
target
state
return loss
voice signal
Prior art date
Application number
PCT/CN2019/111319
Other languages
English (en)
French (fr)
Inventor
朱赛男
浦宏杰
鄢仁祥
曹李军
Original Assignee
苏州科达科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州科达科技股份有限公司 filed Critical 苏州科达科技股份有限公司
Priority to US17/438,640 priority Critical patent/US11869528B2/en
Priority to EP19918901.0A priority patent/EP3929919A4/en
Publication of WO2020181766A1 publication Critical patent/WO2020181766A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0324Details of processing therefor
    • G10L21/034Automatic adjustment
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/06Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being correlation coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/56Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities
    • H04M3/568Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities audio processing specific to telephonic conferencing, e.g. spatial distribution, mixing of participants
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M9/00Arrangements for interconnection not involving centralised switching
    • H04M9/08Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M9/00Arrangements for interconnection not involving centralised switching
    • H04M9/08Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic
    • H04M9/082Two-way loud-speaking telephone systems with means for conditioning the signal, e.g. for suppressing echoes for one or both directions of traffic using echo cancellers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Definitions

  • This application relates to the field of communication technology, and in particular to a voice signal processing method, device, equipment and readable storage medium.
  • echo cancellation plays a vital role, and its performance directly affects the quality of user conference calls.
  • the echo is formed by the target sound after the impulse response of the room channel. Therefore, the echo channel can be simulated by an adaptive filter to achieve the effect of echo cancellation.
  • there are often various problems such as the limitation of adaptive filtering convergence, the nonlinear distortion of the speaker, the environmental background noise, and the delay jitter.
  • the current echo cancellation technology generally adopts a method of cascading a residual echo suppressor after a linear echo canceler to achieve the effect of echo cancellation.
  • the most commonly used method to realize residual echo estimation is to use the adaptive filter to output the estimated echo and leakage factor to realize the residual echo estimation, so as to realize the suppression of the residual echo.
  • This method can get better residual echo estimation only when the acoustic environment is good and the filter converges.
  • a dynamic range controller is often cascaded behind the residual echo suppressor to ensure a stable sound volume.
  • the commonly used dynamic range controller only judges the presence or absence of voice based on the magnitude of the sound, so as to track the sound envelope and adjust the volume gain.
  • the dynamic range controller at this time still recognizes that there is voice and performs the wrong tracking of the envelope, which causes the volume gain adjustment to rapidly increase or decrease, thereby This has caused a series of low sound quality problems such as the increase in echo and the sudden amplification of subsequent normal local sounds or the sudden decrease of subsequent normal local sounds.
  • the purpose of this application is to provide a voice signal processing method, device, equipment and readable storage medium to improve the quality of voice signal processing.
  • a voice signal processing method including:
  • the state coefficient is updated by using the statistical information, and the volume control of the target voice signal is performed when the state coefficient is updated.
  • the derivative signal includes a residual signal, an estimated echo signal, a far-end reference signal, and a near-end signal; using the derivative signal to calculate the initial return loss corresponding to each signal frame in the target voice signal includes:
  • the derivative signal is converted into a frequency domain signal, and the autocorrelation power spectrum is calculated; wherein the frequency domain signal includes the residual frequency domain signal, the estimated echo frequency domain signal, the far-end reference frequency domain signal and the near-end frequency domain signal, so
  • the signal autocorrelation power spectrum includes the residual signal autocorrelation power spectrum, the estimated echo signal autocorrelation power spectrum and the far-end reference signal autocorrelation power spectrum;
  • said using said frequency domain signal and said autocorrelation power spectrum to calculate said initial return loss includes:
  • the initial return loss corresponding to each signal frame in the target voice signal is determined.
  • calculating the near-to-far coherence value of the target language signal by using the derived signal includes:
  • the near-to-far-end coherence value is calculated using the frequency domain signal.
  • determining whether the target return loss matches the single talk echo state includes:
  • updating the state coefficient with the statistical information includes:
  • the state coefficient is recalculated using the median value, and the state coefficient is updated using the calculation result.
  • performing volume control on the target voice signal when updating the state coefficient includes:
  • a gain value matching the decibel calculation result is determined, and the gain value is used to control the volume of the target voice signal.
  • a voice signal processing device includes:
  • a signal acquisition module for acquiring state coefficients, target voice signals to be processed, and derived signals of the target voice signals
  • the target return loss acquisition module is configured to use the derivative signal to calculate the initial return loss corresponding to each signal frame in the target voice signal, and use the state coefficient to adjust the initial return loss to obtain the target Return loss
  • a near-to-far coherence value calculation module configured to use the derivative signal to calculate a near-to-far coherence value of the target language signal
  • the judgment module is used to judge whether the target return loss matches the single talk echo state
  • the near-to-far coherence value recording module is configured to record the near-to-far coherence value in the state statistical data when the target return loss matches the state of the single talk echo, and update the statistical information;
  • the volume control module is configured to update the state coefficient by using the statistical information, and perform volume control on the target voice signal when the state coefficient is updated.
  • a voice signal processing device including:
  • Memory used to store computer programs
  • the processor is used to implement the steps of the voice signal processing method when the computer program is executed.
  • the coefficient adjusts the initial return loss to obtain the target return loss; uses the derived signal to calculate the near and far-end coherence value of the target language signal; judges whether the target return loss matches the state of the single talk echo; if it is, then the state statistics Record the near and far-end coherence value in the, and update the statistical information; use the statistical information to update the state coefficient, and control the volume of the target voice signal when the state coefficient is updated.
  • the derivative signal After obtaining the state coefficient, the target voice signal to be processed and the derivative signal of the target voice signal, the derivative signal can be used to calculate the initial return loss corresponding to each signal frame in the target voice signal. Then, use the state coefficient to adjust the initial return loss to obtain the target return loss.
  • the volume gain adjustment In order to avoid the erroneous belief that there is speech and the wrong tracking of the envelope in the case of single talk echo, the volume gain adjustment will further increase or decrease rapidly, which will cause the echo to increase and the subsequent local voice signal to increase and decrease. The problem of low quality of serial sound signals.
  • the target return loss After determining the target return loss and calculating the near and far-end coherence value, it can be judged whether the target return loss matches the state of the single talk echo, and if it matches, record it in the state statistics and update the statistics. . Then use the statistical information to update the status coefficient, and when the status coefficient is updated, the volume control of the target voice is performed.
  • the method has the advantages of high environmental applicability, strong echo suppression ability, and high sound quality.
  • the environmental adaptability is high, that is, no matter how complex and changeable the objective factors such as reverberation, noise, nonlinear distortion and so on in the conference environment are, the use of state statistical data to update the state coefficients does not need to be limited by theoretical values, and is more adaptive It reflects the similarity of the near and far-end signals in the current environment of single-talk; the sound quality is high, that is, the volume control of the target voice signal when the state coefficient is updated can avoid the echo is not eliminated in the single-talk situation (the echo remains slightly) Or very large), the wrong tracking of the loudness envelope and the wrong adjustment of the gain can improve the sound quality. That is to say, the method provided by the embodiment of the present application can improve the signal quality of the target voice signal after processing the target voice signal, can ensure the sound quality when the target voice signal is played, and can further improve the user experience.
  • the embodiments of the present application also provide voice signal processing apparatuses, equipment, and readable storage media corresponding to the foregoing voice signal processing methods, which have the foregoing technical effects, and will not be repeated here.
  • FIG. 1 is an implementation flowchart of a voice signal processing method in an embodiment of this application
  • FIG. 2 is a technical block diagram of a voice signal processing method in an embodiment of this application.
  • FIG. 3 is a specific implementation flowchart of a voice signal processing method in an embodiment of this application.
  • FIG. 4 is a schematic structural diagram of a voice signal processing device in an embodiment of the application.
  • FIG. 5 is a schematic structural diagram of a voice signal processing device in an embodiment of the application.
  • FIG. 6 is a schematic diagram of a specific structure of a voice signal processing device in an embodiment of the application.
  • FIG. 1 is a flowchart of a voice signal processing method in an embodiment of this application. The method includes the following steps:
  • S101 Obtain state coefficients, a target voice signal to be processed, and a derivative signal of the target voice signal.
  • the processed target speech signal may be a signal obtained by performing echo cancellation processing on the original speech signal using an adaptive filter
  • the derived signal of the target speech may include residual signals, estimated Echo signal, far-end reference signal and near-end signal.
  • e(n) in this document represents the residual signal
  • y(n) represents the estimated echo signal
  • x(n) represents the far-end reference signal
  • d(n) represents the near-end signal.
  • the state coefficient is the coefficient that characterizes the state of the target voice signal, which can be obtained from the preamble voice signal processing process. The state coefficient is adjusted according to the specific signal state of the processed target voice signal. The adjustment process can be detailed in the following state The update process of the coefficient.
  • S102 Calculate the initial return loss corresponding to each signal frame in the target voice signal by using the derivative signal, and use the state coefficient to adjust the initial return loss to obtain the target return loss.
  • the derivative signal After obtaining the state coefficient, the target voice signal, and the derivative signal of the target voice signal, the derivative signal can be used to calculate the initial return loss corresponding to each signal frame in the target voice signal. In order to improve the accuracy of the return loss, the state coefficient can be used to adjust the initial return signal to obtain a more accurate target return loss.
  • the process of calculating the initial return loss corresponding to each signal frame in the target voice signal by using the derived signal includes:
  • Step 1 Convert the derived signal to frequency domain signal and calculate the autocorrelation power spectrum; among them, the frequency domain signal includes residual frequency domain signal, estimated echo frequency domain signal, far-end reference frequency domain signal and near-end frequency domain signal, Signal autocorrelation power spectrum includes residual signal autocorrelation power spectrum, estimated echo signal autocorrelation power spectrum and far-end reference signal autocorrelation power spectrum;
  • Step 2 Combine the frequency domain signal and the autocorrelation power spectrum to calculate the initial return loss.
  • the time-domain derivative signal is converted into a frequency-domain signal, which can be realized by Fourier transform.
  • the derived signal includes the residual signal, the estimated echo signal, the far-end reference signal, and the near-end signal
  • the frequency domain signal obtained after the signal conversion includes the residual frequency domain signal, the estimated echo frequency domain signal, and the far-end reference frequency.
  • Domain signal and near-end frequency domain signal For ease of description, e(w) in this article represents the residual frequency domain signal, Y(w) represents the estimated echo frequency domain signal, X(w) represents the far-end reference frequency domain signal, and D(w) represents the near-end frequency domain. signal.
  • the frequency domain signal can be used to calculate the autocorrelation power spectrum of various constituent signals in the target speech signal. Specifically, the residual signal autocorrelation power spectrum S EE (w), the estimated echo autocorrelation power spectrum S YY (w), and the far-end reference signal autocorrelation power spectrum S XX (w) can be calculated.
  • the specific realization process of calculating the initial return loss includes:
  • Step 1 Perform unary linear regression analysis on the residual signal and estimated echo signal to obtain the leakage coefficient of each signal frame in the target speech signal;
  • Step 2 Determine the residual echo estimation value according to the corresponding relationship between the preset leakage coefficient and the signal autocorrelation power spectrum, combined with the signal autocorrelation power spectrum;
  • Step 3 Use the residual echo suppression function to suppress the residual echo to obtain a residual echo suppression signal
  • Step 4 Use the residual echo suppression signal and the near-end signal to determine the initial return loss corresponding to each signal frame in the target voice signal.
  • R EY (l,w) (1- ⁇ )*R EY (l-1,w)+ ⁇ *E(w)*Y(w)
  • R YY (l,w) (1- ⁇ )* R YY (l-1,w)+ ⁇ *S YY (w)
  • R EY (0,w) 1
  • R EY (0,w) 1
  • is the smoothing coefficient (the value range of ⁇ In the 0-1 interval, for example, 0.93 can be taken).
  • the estimated value of residual echo REcho(w) calculates the estimated value of residual echo REcho(w) according to the corresponding relationship between the preset leakage coefficient and the signal autocorrelation power spectrum:
  • the leakage factor Leak is greater than 0.5
  • the estimated value of residual echo REcho(w) S YY (w)
  • REcho(w) 2*S YY (w)
  • the state coefficient can be used for fine-tuning, and the residual echo estimation value is further obtained:
  • REcho(w) REcho(w)*Ralpha(w), where Ralpha(w) is the state coefficient.
  • the residual echo suppression function is S EE (w) is the autocorrelation power spectrum of the residual signal
  • S YY (w) is the autocorrelation power spectrum of the estimated echo signal
  • REcho(w) is the estimated value of the residual echo.
  • the residual echo suppression function is used for echo suppression to obtain the signal e2(n) output by the residual echo suppressor.
  • N is the frame length.
  • the above method of calculating the return loss is only an optional implementation manner listed in this application.
  • it can be calculated in a manner related to or similar to the above method, for example, modification And increase the weight coefficient of related parameters, increase the error offset and so on.
  • the calculation of the coherence value, the transient envelope value, the power, etc. described in the embodiments of the present application are all the same, and they are only an optional calculation method and are not improperly limited.
  • S103 Calculate the near and far-end coherence value of the target language signal by using the derived signal.
  • the far-end reference frequency domain signal and the near-end frequency domain signal may be used to calculate the near-far-end coherence value of the target voice signal.
  • the main sound source of the target voice signal processed by the embodiment of this application is the human voice, which is converted by the sound sensor, in order to reduce other noise interference, when calculating the near and far coherence value
  • the human voice frequency Within the range, the frequency domain signal is used to calculate the near and far-end coherence value.
  • the human voice frequency range is (300Hz ⁇ 3000Hz), that is to say, when calculating the near and far-end coherence value, the frequency search range is (300Hz ⁇ 3000Hz).
  • a judgment threshold for judging whether the target return loss matches the single talk echo state matching can be preset, that is, the preset threshold Terle .
  • the preset threshold Terle According to the ITU G.167 standard, the Terle value here is recommended to be -40dB.
  • the specific judgment process includes:
  • Step 1 Determine whether the target return loss is less than a preset threshold
  • Step 2 If yes, determine that the target return loss matches the single talk echo state
  • Step 3 If no, determine that the target return loss does not match the single talk echo state.
  • step S105 When the return loss ERLR ⁇ T erle value, it is initially set as the single talk echo state, and the operation of step S105 is executed; otherwise, no processing is required, and no operation is required.
  • S105 Record the near and far-end coherence values in the state statistical data, and update the statistical information.
  • the state statistical data can be stored in the form of a state statistical histogram.
  • the state statistical data can also be stored in common data statistical forms such as tables or sequences.
  • the statistical information may include the statistical amount and statistical time of the recorded near and far-end coherence values.
  • S106 Use the statistical information to update the state coefficient, and perform volume control on the target voice signal when the state coefficient is updated.
  • the state of the target voice signal can be known by combining the statistical information, and the statistical information can be used to perform the state coefficients. Update. Specifically, when the state statistical data is stored in the form of a statistical histogram, the median value of the statistical histogram can be obtained when the statistical amount in the statistical information is greater than the preset statistical threshold; the state coefficient is recalculated using the median value, and Use the calculation result to update the state coefficient. Calculating the state coefficient based on the median value, without being limited by the theoretical value, can better reflect the state of the target voice signal.
  • the transient envelope value of the target voice signal is calculated; when the transient envelope value is greater than the preset noise threshold, the envelope is updated; the updated envelope is used to calculate the decibel value to obtain Decibel calculation result: According to the mapping relationship between decibel and gain, determine the gain value matching the decibel calculation result, and use the gain value to control the volume of the target voice signal.
  • the specific calculation process of the transient envelope value is to calculate the energy average of the target voice signal.
  • the envelope is the curve that reflects the change of signal amplitude, and the value corresponding to each point on the envelope can be regarded as the transient envelope value.
  • the transient envelope value After obtaining the transient envelope value, you can determine whether to update the envelope by judging whether the transient envelope value is greater than the preset noise threshold. That is, when the transient envelope value is greater than the preset threshold, it is deemed that there is speech, and the envelope can be updated at this time. Then, the closed calculation is performed on the updated envelope, and finally the gain value is determined based on the decibel value. In this way, gain control can be performed on the target voice signal, that is, the volume of the voice signal can be controlled.
  • the coefficient adjusts the initial return loss to obtain the target return loss; uses the derived signal to calculate the near and far-end coherence value of the target language signal; judges whether the target return loss matches the state of the single talk echo; if it is, then the state statistics Record the near and far-end coherence value in the, and update the statistical information; use the statistical information to update the state coefficient, and control the volume of the target voice signal when the state coefficient is updated.
  • the derivative signal After obtaining the state coefficient, the target voice signal to be processed and the derivative signal of the target voice signal, the derivative signal can be used to calculate the initial return loss corresponding to each signal frame in the target voice signal. Then, use the state coefficient to adjust the initial return loss to obtain the target return loss.
  • the volume gain adjustment In order to avoid the erroneous belief that there is speech and the wrong tracking of the envelope in the case of single talk echo, the volume gain adjustment will further increase or decrease rapidly, which will cause the echo to increase and the subsequent local voice signal to increase and decrease. The problem of low quality of serial sound signals.
  • the target return loss After determining the target return loss and calculating the near and far-end coherence value, it can be judged whether the target return loss matches the state of the single talk echo, and if it matches, record it in the state statistics and update the statistics. . Then use the statistical information to update the status coefficient, and when the status coefficient is updated, the volume control of the target voice is performed.
  • the method has the advantages of high environmental applicability, strong echo suppression ability, and high sound quality.
  • the environmental adaptability is high, that is, no matter how complex and changeable the objective factors such as reverberation, noise, nonlinear distortion and so on in the conference environment are, the use of state statistical data to update the state coefficients does not need to be limited by theoretical values, and is more adaptive It reflects the similarity of the near and far-end signals in the current environment of single-talk; the sound quality is high, that is, the volume control of the target voice signal when the state coefficient is updated can avoid the echo is not eliminated in the single-talk situation (the echo remains slightly) Or very large), the wrong tracking of the loudness envelope and the wrong adjustment of the gain can improve the sound quality. That is to say, the method provided by the embodiment of the present application can improve the signal quality of the target voice signal after processing the target voice signal, can ensure the sound quality when the target voice signal is played, and can further improve the user experience.
  • FIG. 2 is a technical block diagram of a voice signal processing method in an embodiment of this application.
  • the voice signal processing method provided by this application can be composed of three parts: a residual echo suppressor, a state statistical analyzer, and a dynamic range controller.
  • the state coefficient is used as the output of the state statistical analyzer, and the residual echo suppressor is adjusted by feedback. Echo estimation is also used as a judgment basis for the loudness tracking of the dynamic range controller.
  • FIG. 3 is a specific implementation flowchart of a voice signal processing method in an embodiment of the application.
  • the specific implementation process of this method includes:
  • the residual echo suppressor obtains the residual signal e(n) of the adaptive filter, the estimated echo signal y(n), the far-end reference signal x(n) and the near-end signal d(n), and The Fourier transform is performed to obtain the residual signal frequency domain signal e(w), the estimated echo frequency domain signal Y(w), the far-end reference frequency domain signal X(w) and the near-end frequency domain signal D(w).
  • S202 Calculate the leakage coefficient: Calculate the leakage coefficient Leak(l) of the residual echo of the current frame (the lth frame). This value reflects the leakage degree of the echo relative to the far-end signal.
  • the residual signal e(n) and the estimated echo y( n) obtained by unary linear regression analysis, which is expressed as Among them, R EY (l,w) is the frequency domain mutual interference value obtained by the recursive average of the residual signal and the estimated echo, and R YY (l,w) is the autocorrelation value obtained by the recursive average of the estimated echo, its expression They are:
  • R EY (l,w) (1- ⁇ )*R EY (l-1,w)+ ⁇ *E(w)*Y(w)
  • R YY (l,w) (1- ⁇ )* R YY (l-1,w)+ ⁇ *S YY (w)
  • REcho(w) REcho(w) *Ralpha(w)
  • Ralpha(w) is the state coefficient fed back by the state statistical analyzer
  • S204 Suppress the residual echo: use the function of residual echo suppression Perform echo suppression to obtain the signal e2(n) output by the residual echo suppressor;
  • S205 Calculate the target return loss: use Calculate the target return loss, where N is the frame length;
  • S206 Calculate the distribution of the far and near coherence values: the average value of the far and near coherence is Among them, the frequency search range corresponding to startbin and endbin is 300Hz ⁇ 3000Hz, and this frequency range is the main frequency range of the speaker's speech.
  • the calculation methods of S XD (w), S X (w)) and S D (w) are as follows:
  • S207 Update the histogram of status statistics in the status analysis statistic:
  • the target return loss ERLR ⁇ T erle value (according to the ITU G.167 standard, the Terle value here is recommended to choose -40dB)
  • the initial setting is single talk echo Status
  • the current C XD is added to the status statistical histogram to update, and at the same time, the statistical number of the status statistical histogram is increased by one; otherwise, the histogram statistics are not added.
  • S208 Calculate the state coefficient: if the statistical quantity of the state statistical histogram is greater than T N , take the median value T XD of the histogram statistics.
  • alpha is the weight coefficient
  • the value range is 0 ⁇ 1
  • the range of the state coefficient is controlled in: between. If the statistical number of the histogram is greater than T N , the state coefficient is fed back to the residual echo suppressor and the dynamic range controller.
  • the instantaneous envelope value of the current target speech signal is When EvenlopTemp>T noise , it is considered that there is speech in the current state. Considering that the noise is generally distributed below -60dB, here T noise is taken as 0.000001. In the presence of speech signals, if and only in these two cases (the statistical number of histograms is less than T N , the statistical number of histograms is greater than T N and C XD ⁇ (0.75*T XD ), where in the inequality The value is 0.75. The reason for the setting is mainly to consider the upper value of the average value, which is more reliable. Of course, this value can also be selected between 0 and 1) before the envelope update:
  • the dynamic gain controller calculates the decibel value: calculating the decibel value corresponding to the current frame of sound can be expressed as:
  • the speech signal processing method provided by this application is a post-processing method for echo cancellation based on statistical feedback, which is composed of a residual echo suppressor, a state statistical analyzer, and a dynamic range controller.
  • the state statistical analyzer calculates the return loss ERLE in real time according to the residual signal e(n) output by the residual echo suppressor and the near-end signal d(n), and analyzes 300Hz ⁇ 3000Hz (the main frequency range of human voice, Conducive to reducing the interference of noise) the average value of the mutual interference between the near and far ends, Cxd.
  • the near and far-end cross-correlation value is often not theoretically 1 in the case of purely speaking. Therefore, when ERLE is less than the critical value Terle , If the single-talk scene is identified (matching the single-talk echo state), the Cxd at this time is added to the state statistical histogram for statistics, so as to determine the approximate distribution of the near and far-end coherence values in the single-talk environment.
  • the state statistics histogram statistics exceed TN frames, and it is determined to be stable, the mutual interference value Txd corresponding to the median position of the statistics value is taken in real time.
  • the possibility of the current echo is determined, and the state coefficient is calculated, and the state coefficient is fed back to the previous residual echo suppressor for residual Echo estimation is a basis to avoid underestimation of residual echo and avoid leakage of echo; in addition, the state coefficient is output to the dynamic range controller.
  • the state coefficient value is greater than the threshold T thre , even if the voice detection exists, it is not Envelope tracking is carried out.
  • the embodiments of the present application also provide a voice signal processing device.
  • the voice signal processing device described below and the voice signal processing method described above can be referred to each other.
  • the device includes the following modules:
  • the signal acquisition module 101 is used to acquire the state coefficient, the target voice signal to be processed, and the derivative signal of the target voice signal;
  • the target return loss acquisition module 102 is configured to use the derivative signal to calculate the initial return loss corresponding to each signal frame in the target voice signal, and use the state coefficient to adjust the initial return loss to obtain the target return loss;
  • the near and far-end coherence value calculation module 103 is configured to use the derived signal to calculate the near-far-end coherence value of the target language signal;
  • the judging module 104 is used to judge whether the target return loss matches the single talk echo state
  • the near-to-far coherence value recording module 105 is used to record the near-to-far coherence value in the state statistical data and update the statistical information when the target return loss matches the state of the single talk echo;
  • the volume control module 106 is used to update the state coefficients using statistical information, and to control the volume of the target voice signal when the state coefficients are updated.
  • the coefficient adjusts the initial return loss to obtain the target return loss; uses the derived signal to calculate the near and far-end coherence value of the target language signal; judges whether the target return loss matches the state of the single talk echo; if it is, then the state statistics Record the near and far-end coherence value in the, and update the statistical information; use the statistical information to update the state coefficient, and control the volume of the target voice signal when the state coefficient is updated.
  • the derivative signal After obtaining the state coefficient, the target voice signal to be processed and the derivative signal of the target voice signal, the derivative signal can be used to calculate the initial return loss corresponding to each signal frame in the target voice signal. Then, use the state coefficient to adjust the initial return loss to obtain the target return loss.
  • the volume gain adjustment In order to avoid the erroneous belief that there is speech and the wrong tracking of the envelope in the case of single talk echo, the volume gain adjustment will further increase or decrease rapidly, which will cause the echo to increase and the subsequent local voice signal to increase and decrease. The problem of low quality of serial sound signals.
  • the target return loss After determining the target return loss and calculating the near and far-end coherence value, it can be judged whether the target return loss matches the state of the single talk echo, and if it matches, record it in the state statistics and update the statistics. . Then use the statistical information to update the status coefficient, and when the status coefficient is updated, the volume control of the target voice is performed.
  • the device Compared with the prior art, the device has the advantages of high environmental applicability, strong echo suppression ability, and high sound quality.
  • the environmental adaptability is high, that is, no matter how complex and changeable the objective factors such as reverberation, noise, nonlinear distortion and so on in the conference environment are, the use of state statistical data to update the state coefficients does not need to be limited by theoretical values, and is more adaptive It reflects the similarity of the near and far-end signals in the current environment of single-talk; the sound quality is high, that is, the volume control of the target voice signal when the state coefficient is updated can avoid the echo is not eliminated in the single-talk situation (the echo remains slightly) Or very large), the wrong tracking of the loudness envelope and the wrong adjustment of the gain can improve the sound quality.
  • the device provided by the embodiment of the present application can improve the signal quality of the target voice signal after processing the target voice signal, can ensure the sound quality when the target voice signal is played, and further improve the user experience.
  • the derivative signal includes a residual signal, an estimated echo signal, a far-end reference signal, and a near-end signal.
  • the target return loss acquisition module 102 is specifically used to convert the derivative signal into Frequency domain signal, and calculate the autocorrelation power spectrum; among them, the frequency domain signal includes the residual frequency domain signal, the estimated echo frequency domain signal, the far-end reference frequency domain signal and the near-end frequency domain signal, and the signal autocorrelation power spectrum includes the residual Signal autocorrelation power spectrum, estimated echo signal autocorrelation power spectrum and far-end reference signal autocorrelation power spectrum; combine frequency domain signal and autocorrelation power spectrum to calculate initial return loss.
  • the target return loss acquisition module 102 is specifically configured to perform unary linear regression analysis on the residual signal and the estimated echo signal to obtain the leakage coefficient of each signal frame in the target speech signal; Set the correspondence between the leakage coefficient and the signal autocorrelation power spectrum, and combine the signal autocorrelation power spectrum to determine the residual echo estimate; use the residual echo suppression function to suppress the residual echo to obtain the residual echo suppression signal; use the residual echo suppression signal and Near-end signal, the initial return loss corresponding to each signal frame in the target voice signal is determined.
  • the near-to-far coherence value recording module 105 is specifically configured to calculate the near-to-far coherence value by using a frequency domain signal within the frequency range of human speech.
  • the determining module 104 is specifically configured to determine whether the target return loss is less than a preset threshold; if it is, it is determined that the target return loss matches the single talk echo state; if not, it is determined The target return loss does not match the state of the single talk echo.
  • the near-to-far coherence value recording module 105 is used to store the state statistical data in the form of statistical histograms, and obtain statistics when the statistical quantity in the statistical information is greater than a preset statistical threshold.
  • the median value of the histogram use the median value to recalculate the state coefficient, and use the calculation result to update the state coefficient.
  • the volume control module 106 is specifically configured to calculate the transient envelope value of the target voice signal when the state coefficient is updated; when the transient envelope value is greater than the preset noise threshold, perform Envelope update; use the updated envelope to calculate the decibel value to obtain the decibel calculation result; according to the mapping relationship between decibel and gain, determine the gain value that matches the decibel calculation result, and use the gain value to control the volume of the target voice signal.
  • the embodiments of the present application also provide a voice signal processing device.
  • the voice signal processing device described below and the voice signal processing method described above can be referred to each other.
  • the voice signal processing device includes:
  • the memory D1 is used to store computer programs
  • the processor D2 is configured to implement the steps of the voice signal processing method in the foregoing method embodiment when executing a computer program.
  • FIG. 6 is a schematic diagram of a specific structure of a voice signal processing device provided by this embodiment.
  • the voice signal processing device may have relatively large differences due to different configurations or performance, and may include one or one The above central processing units (CPU) 322 (for example, one or more processors) and the memory 332, one or more storage media 330 for storing application programs 342 or data 344 (for example, one or one storage device with a large amount of storage) .
  • the memory 332 and the storage medium 330 may be short-term storage or persistent storage.
  • the program stored in the storage medium 330 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the data processing device.
  • the central processing unit 322 may be configured to communicate with the storage medium 330 and execute a series of instruction operations in the storage medium 330 on the voice signal processing device 301.
  • the voice signal processing device 301 may also include one or more power sources 326, one or more wired or wireless network interfaces 350, one or more input and output interfaces 358, and/or one or more operating systems 341.
  • one or more power sources 326 for example, Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
  • the steps in the voice signal processing method described above can be implemented by the structure of the voice signal processing device.
  • the embodiments of the present application also provide a readable storage medium, and a readable storage medium described below and a voice signal processing method described above can be referred to each other.
  • a readable storage medium in which a computer program is stored, and when the computer program is executed by a processor, the steps of the voice signal processing method in the above method embodiment are implemented.
  • the readable storage medium can specifically be a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a magnetic disk, or an optical disk that can store program codes. Readable storage medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Quality & Reliability (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Telephone Function (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

提供了一种语音信号处理方法,该方法包括以下步骤:获取状态系数、待处理的目标语音信号以及目标语音信号的衍生信号(S101);利用衍生信号计算目标语音信号中各信号帧分别对应的初始回波损耗,并利用状态系数对初始回波损耗进行调整,获得目标回波损耗(S102);利用衍生信号计算目标语言信号的近远端相干值(S103);判断目标回波损耗与单讲回声状态是否匹配(S104);如果是,则在状态统计数据中记录近远端相干值,并更新统计信息(S105);利用统计信息更新状态系数,并在更新状态系数时对目标语音信号进行音量控制(S106)。该方法可保证声音质量,可进一步改善用户体验。还提供了一种语音信号处理装置、设备及可读存储介质,具有相应的技术效果。

Description

一种语音信号处理方法、装置、设备及可读存储介质 技术领域
本申请涉及通信技术领域,特别是涉及一种语音信号处理方法、装置、设备及可读存储介质。
背景技术
在VOIP应用(如软件视频会议、VOIP电话会议等)中,回声消除扮演着一个至关重要的角色,其性能的好坏直接影响用户会议通话的质量。在理论上,回声是由目标声音经过房间信道冲激响应后形成的,因此,可以通过自适应滤波器对回声信道的模拟,从而实现回声消除的效果。但在实际的会议通话场景中,往往存在自适应滤波收敛的局限、扬声器的非线性失真、环境背景噪声、时延抖动等各种问题。
目前的回音消除技术一般都采用线性回声消除器后级联残留回声抑制器的方法来实现消除回声的效果。实现残留回声估计最常用的方法是利用自适应滤波器输出估计的回声和泄露因子来实现残留回声估计,从而来实现残留回声的抑制。该方法在声学环境良好、滤波器收敛的情况下,才能得到较好的残留回声估计,但实际复杂环境下,滤波器很难得到真正意义上的收敛,往往估计出回声偏小,导致残留回声估计偏小,因此出现轻微回声泄露的问题。另外,为了保证会议上互动声音的质量,往往在残留回声抑制器后面级联动态范围控制器,来确保声音音量的平稳。而常用的动态范围控制器,仅仅根据声音幅度的大小来判断语音的存在与否,从而进行声音包络的跟踪和音量增益的调节。但是在出现前级回声消除较差、有回声泄露的单讲情况下,此时的动态范围控制器仍认定有语音并进行包络的错误跟踪,致使音量增益调节快速变大或变小,从而引起了回声变大以及后续正常本地声音忽然放大或是后续正常本地声音忽然变小等一系列声音质量低下的问题。
综上所述,如何有效地提高语音信号处理质量等问题,是目前本领域技术人员急需解决的技术问题。
发明内容
本申请的目的是提供一种语音信号处理方法、装置、设备及可读存储介质,以提高语音信号处理质量。
为解决上述技术问题,本申请提供如下技术方案:
一种语音信号处理方法,包括:
获取状态系数、待处理的目标语音信号以及所述目标语音信号的衍生信号;
利用所述衍生信号计算所述目标语音信号中各信号帧分别对应的初始回波损耗,并利用所述状态系数对所述初始回波损耗进行调整,获得目标回波损耗;
利用所述衍生信号计算所述目标语言信号的近远端相干值;
判断所述目标回波损耗与单讲回声状态是否匹配;
如果是,则在状态统计数据中记录所述近远端相干值,并更新统计信息;
利用所述统计信息更新所述状态系数,并在更新所述状态系数时对所述目标语音信号进行音量控制。
优选地,所述衍生信号包括残差信号、估计回声信号、远端参考信号和近端信号;利用所述衍生信号计算所述目标语音信号中各信号帧分别对应的初始回波损耗,包括:
将所述衍生信号转换为频域信号,并计算自相关功率谱;其中,频域信号包括残差频域信号、估计回声频域信号、远端参考频域信号和近端频域信号,所述信号自相关功率谱包括残差信号自相关功率谱、估计回声信号自相关功率谱和远端参考信号自相关功率谱;
结合所述频域信号和所述自相关功率谱,计算所述初始回波损耗。
优选地,所述利用所述频域信号和所述自相关功率谱,计算所述初始回波损耗,包括:
对所述残差信号和所述估计回声信号进行一元线性回归分析,获得所述目标语音信号中各个所述信号帧的泄漏系数;
按照预设所述泄漏系数与所述信号自相关功率谱的对应关系,并结合所述信号自相关功率谱,确定残留回声估计值;
利用残留回声抑制函数对残留回声进行抑制,获得残留回声抑制信号;
利用所述残留回声抑制信号和所述近端信号,确定出所述目标语音信号中各信号帧分别对应的初始回波损耗。
优选地,利用所述衍生信号计算所述目标语言信号的近远端相干值,包括:
在人类语音频率范围内,利用所述频域信号计算所述近远端相干值。
优选地,判断所述目标回波损耗与单讲回声状态是否匹配,包括:
判断所述目标回波损耗是否小于预设阈值;
如果是,则确定所述目标回波损耗与所述单讲回声状态匹配;
如果否,则确定所述目标回波损耗与所述单讲回声状态不匹配。
优选地,在所述状态统计数据以统计直方图的方式存储时,利用所述统计信息更新所述状态系数,包括:
所述统计信息中的统计量大于预设统计阈值时,获取所述统计直方图的中位值;
利用所述中位值重新计算所述状态系数,并利用计算结果更新所述状态系数。
优选地,在更新所述状态系数时对所述目标语音信号进行音量控制,包括:
在更新所述状态系数时,计算所述目标语音信号的瞬态包络值;
在所述瞬态包络值大于预设噪音阈值时,进行包络更新;
利用更新后的包络进行分贝值计算,获得分贝计算结果;
按照分贝与增益的映射关系,确定与所述分贝计算结果匹配的增益值,并利用所述增益值对所述目标语音信号进行音量控制。
一种语音信号处理装置,包括:
信号获取模块,用于获取状态系数、待处理的目标语音信号以及所述目标语音信号的衍生信号;
目标回波损耗获取模块,用于利用所述衍生信号计算所述目标语音信号中各信号帧分别对应的初始回波损耗,并利用所述状态系数对所述初始回波损耗进行调整,获得目标回波损耗;
近远端相干值计算模块,用于利用所述衍生信号计算所述目标语言信号的近远端相干值;
判断模块,用于判断所述目标回波损耗与单讲回声状态是否匹配;
近远端相干值记录模块,用于在所述目标回波损耗与所述单讲回波状态匹配时,在状态统计数据中记录所述近远端相干值,并更新统计信息;
音量控制模块,用于利用所述统计信息更新所述状态系数,并在更新所述状态系数时对所述目标语音信号进行音量控制。
一种语音信号处理设备,包括:
存储器,用于存储计算机程序;
处理器,用于执行所述计算机程序时实现上述语音信号处理方法的步骤。
一种可读存储介质,所述可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现上述语音信号处理方法的步骤。
应用本申请实施例所提供的方法,获取状态系数、待处理的目标语音信号以及目标语音信号的衍生信号;利用衍生信号计算目标语音信号中各信号帧分别对应的初始回波损耗,并利用状态系数对初始回波损耗进行调整,获得目标回波损耗;利用衍生信号计算目标语言信号的近远端相干值;判断目标回波损耗与单讲回声状态是否匹配;如果是,则在状态统计数据中记录近远端相干值,并更新统计信息;利用统计信息更新状态系数,并在更新状态系数时对目标语音信号进行音量控制。
在获取到状态系数和待处理的目标语音信号以及目标语音信号的衍生信号之后,可利用衍生信号计算目标语音信号中各个信号帧分别对应的初始回波损耗。然后,利用状态系数对初始回波损耗进行调整,得到目标回波损耗。为了避免在单讲回声情况下,错误地认为存在语音并对包络进行错误跟踪,进一步导致音量增益调节快速变大或变小,从而引起回声变大以及后续本地语音信号忽大忽小等一系列声音信号质量低下的问题。在确定出目标回波损耗,并计算出近远端相干值之后,便可判断该目标回波损耗与单讲回声状态是否匹配,若匹配,则在状态统计数据中进行记录,并更新统计信息。然后再利用统计信息更新状态系数,并在更新状态系数时,对目标语音进行音量控制。
该方法相对与现有技术,具有环境适用度高、回声抑制能力强、声音质量高的优点。其中,环境适应度高,即无论是会议环境中混响、噪声、非线性失真等客观因素有多么复杂多变,利用状态统计数据更新状态系数,不需受理论 值的限制,更能自适应地反应出当前环境单讲情况下,近远端信号的相似程度;声音质量高,即在状态系数更新时对目标语音信号进行音量控制,可避免单讲情况下回音未消除干净(回声残留轻微或很大)时响度包络的错误跟踪和增益的错误调节,可提高声音的质量。也就是说,本申请实施例所提供的方法,对目标语音信号进行处理之后,可提高目标语音信号的信号质量,能够在目标语音信号被播放时,保证声音质量,可进一步改善用户体验。
相应地,本申请实施例还提供了与上述语音信号处理方法相对应的语音信号处理装置、设备和可读存储介质,具有上述技术效果,在此不再赘述。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例中一种语音信号处理方法的实施流程图;
图2为本申请实施例中一种语音信号处理方法的技术框图;
图3为本申请实施例中一种语音信号处理方法的具体实施流程图;
图4为本申请实施例中一种语音信号处理装置的结构示意图;
图5为本申请实施例中一种语音信号处理设备的结构示意图;
图6为本申请实施例中一种语音信号处理设备的具体结构示意图。
具体实施方式
为了使本技术领域的人员更好地理解本申请方案,下面结合附图和具体实施方式对本申请作进一步的详细说明。显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
实施例一:
请参考图1,图1为本申请实施例中一种语音信号处理方法的流程图,该 方法包括以下步骤:
S101、获取状态系数、待处理的目标语音信号以及目标语音信号的衍生信号。
需要说明的是,在本申请实施例中,所处理的目标语音信号可为利用自适应滤波器对原始语音信号进行回声消除处理后得到的信号,目标语音的衍生信号可包括残差信号、估计回声信号、远端参考信号和近端信号。为便于描述,本文中的e(n)表示残差信号、y(n)表示估计回声信号、x(n)表示远端参考信号和d(n)表示近端信号。另外,状态系数即为表征目标语音信号的状态情况的系数,可从前序语音信号处理过程中获得,状态系数根据被处理的目标语音信号的具体信号状态进行调整,调整过程可具体参见下文中状态系数的更新过程。
S102、利用衍生信号计算目标语音信号中各信号帧分别对应的初始回波损耗,并利用状态系数对初始回波损耗进行调整,获得目标回波损耗。
获得状态系数和目标语音信号,以及目标语音信号的衍生信号之后,便可利用衍生信号计算目标语音信号中各个信号帧分别对应的初始回波损耗。为了提高回波损耗的准确率,可利用状态系数对初始回波信号进行调整,获得更为准确的目标回波损耗。
其中,利用衍生信号计算目标语音信号中各信号帧分别对应的初始回波损耗的过程,包括:
步骤一、将衍生信号转换为频域信号,并计算自相关功率谱;其中,频域信号包括残差频域信号、估计回声频域信号、远端参考频域信号和近端频域信号,信号自相关功率谱包括残差信号自相关功率谱、估计回声信号自相关功率谱和远端参考信号自相关功率谱;
步骤二、结合频域信号和自相关功率谱,计算初始回波损耗。
为了便于描述,下面将上述两个步骤结合起来进行说明。
具体的,将时域的衍生信号转换为频域信号,可通过傅立叶变换实现,具体的转变过程可参见傅里叶变换的实现过程,在此不再一一赘述。由于衍生信号包括残差信号、估计回声信号、远端参考信号和近端信号,因而进行信号转变之后,所得的频域信号相应包括残差频域信号、估计回声频域信号、远端参考频域信号和近端频域信号。为了便于描述,本文中的e(w) 表示残差频域信号、Y(w)表示估计回声频域信号、X(w)表示远端参考频域信号和D(w)表示近端频域信号。得到频域信号之后,便可利用频域信号计算目标语音信号中各种组成信号的自相关功率谱。具体的,可计算出得残差信号自相关功率谱S EE(w)、估计回声自相关功率谱S YY(w)和远端参考信号自相关功率谱S XX(w)。
其中,结合频域信号和自相关功率谱,计算初始回波损耗的具体实现过程,包括:
步骤一、对残差信号和估计回声信号进行一元线性回归分析,获得目标语音信号中各个信号帧的泄漏系数;
步骤二、按照预设泄漏系数与信号自相关功率谱的对应关系,并结合信号自相关功率谱,确定残留回声估计值;
步骤三、利用残留回声抑制函数对残留回声进行抑制,获得残留回声抑制信号;
步骤四、利用残留回声抑制信号和近端信号,确定出目标语音信号中各信号帧分别对应的初始回波损耗。
为了便于描述,下面将上述四个步骤结合起来进行说明。
计算当前帧(第l帧)残留回声的泄露系数Leak(l),该值反应回声相对于远端信号的泄露程度,由残差信号e(n)和估计回声y(n)一元线性回归分析得到,其表达为
Figure PCTCN2019111319-appb-000001
其中,R EY(l,w)是残差信号与估计回声信号的递归平均得到的频域互相干值,R YY(l,w)是估计回声信号的递归平均得到的自相关值,其表达式分别为
R EY(l,w)=(1-β)*R EY(l-1,w)+β*E(w)*Y(w),R YY(l,w)=(1-β)*R YY(l-1,w)+β*S YY(w),其中,R EY(0,w)=1,R EY(0,w)=1,β为平滑系数(β的取值范围在0-1区间内,如可取0.93)。
然后按照预设泄漏系数与信号自相关功率谱的对应关系,计算残留回声的估计值REcho(w):当泄露因子Leak大于0.5时,残留回声的估计值REcho(w)=S YY(w),否则REcho(w)=2*S YY(w)。优选地,在系统趋于稳定时,可利 用状态系数进行微调,进一步获得残留回声估计值为:REcho(w)=REcho(w)*Ralpha(w),其中,Ralpha(w)为状态系数。
其中,残留回声抑制函数为
Figure PCTCN2019111319-appb-000002
S EE(w)为残差信号自相关功率谱、S YY(w)为估计回声信号自相关功率谱,REcho(w)为残留回声估计值。利用残留回声抑制函数进行回声抑制,从而获得残留回声抑制器输出的信号e2(n)。
然后,计算目标回波损耗ERLE:
Figure PCTCN2019111319-appb-000003
其中,N为帧长。
需要说明的是,以上计算回波损耗的方式仅仅是本申请所列举的一种可选实施方式,对于本领域技术人员而言,完全可以采用与上述方式相关或者相似的方式计算,例如,修改和增加相关参数的权重系数、增加误差偏移等等。本申请实施例中所述的计算相干值、瞬态包络值、功率等都同理,仅仅只是一种可选的计算方式,并没有不当限定。
S103、利用衍生信号计算目标语言信号的近远端相干值。
具体的,可将衍生信号中的远端参考信号和近端信号转换为频域信号之后,利用远端参考频域信号和近端频域信号来计算目标语音信号的近远端相干值。
优选地,由于本申请实施例所处理的目标语音信号主要声音源为人发出的声音,经声音传感器转换而得,因此为了减少其他噪音干扰,在计算近远端相干值时,可在人类语音频率范围内,利用频域信号计算近远端相干值。其中,人类语音频率范围即(300Hz~3000Hz),也就是说,计算近远端相干值时,频率搜索范围即为(300Hz~3000Hz)。
S104、判断目标回波损耗与单讲回声状态是否匹配。
在本申请实施例中,可预设设置用于判断目标回波损耗是否匹配单讲回声状态匹配的判断阈值,即预设阈值T erle。(根据ITU G.167标准,此处T erle 值建议选-40dB)具体的判断过程,包括:
步骤一、判断目标回波损耗是否小于预设阈值;
步骤二、如果是,则确定目标回波损耗与单讲回声状态匹配;
步骤三、如果否,则确定目标回波损耗与单讲回声状态不匹配。
为便于描述,下面将上述三个步骤结合起来进行说明。
当回波损耗ERLR<T erle值时,初定为单讲回声状态,则执行步骤S105的操作;否则,则可无需进行处理,即可无操作。
S105、在状态统计数据中记录近远端相干值,并更新统计信息。
在计算出近远端相干值之后,并在目标回波损耗与单讲回声状态匹配时,将计算所得的近远端相干值记录在状态统计数据中。为了便于统计,可以以状态统计直方图的形式存储状态统计数据。当然,在本申请的其他实施例中,还可以表格或序列等常见数据统计形式存储该状态统计数据。其中,统计信息可包括所记录近远端相干值的统计量、统计时间。
S106、利用统计信息更新状态系数,并在更新状态系数时对目标语音信号进行音量控制。
由于统计信息的是对满足单讲回声状态下所记录的近远端相干值的状态统计数据的进一步统计结果,结合统计信息便可知目标语音信号的状态情况,即可利用统计信息对状态系数进行更新。具体的,在状态统计数据以统计直方图的方式存储时,可在统计信息中的统计量大于预设统计阈值时,获取统计直方图的中位值;利用中位值重新计算状态系数,并利用计算结果更新状态系数。基于中位值计算状态系数,无需受理论值的限制,能够更好的反应目标语音信号的状态情况。
具体的,在更新状态系数时,计算目标语音信号的瞬态包络值;在瞬态包络值大于预设噪音阈值时,进行包络更新;利用更新后的包络进行分贝值计算,获得分贝计算结果;按照分贝与增益的映射关系,确定与分贝计算结果匹配的增益值,并利用增益值对目标语音信号进行音量控制。其中,瞬态包络值的具体计算过程即计算目标语音信号的能量平均值。包络线就是反映信号幅度变化的曲线,包络线上各个点所对应的值,即可视为瞬态包络值。得到瞬态包络值之后,便可通过判断瞬态包络值是否大于预设噪音阈值,确定是否进行包络更 新。即,当瞬态包络值大于预设阈值时,视为存在语音,此时可进行包络更新。然后,对更新后的包络进行封闭计算,最终基于分贝值确定出增益值。如此,便可对目标语音信号进行增益控制,即可对语音信号的音量进行控制。
应用本申请实施例所提供的方法,获取状态系数、待处理的目标语音信号以及目标语音信号的衍生信号;利用衍生信号计算目标语音信号中各信号帧分别对应的初始回波损耗,并利用状态系数对初始回波损耗进行调整,获得目标回波损耗;利用衍生信号计算目标语言信号的近远端相干值;判断目标回波损耗与单讲回声状态是否匹配;如果是,则在状态统计数据中记录近远端相干值,并更新统计信息;利用统计信息更新状态系数,并在更新状态系数时对目标语音信号进行音量控制。
在获取到状态系数和待处理的目标语音信号以及目标语音信号的衍生信号之后,可利用衍生信号计算目标语音信号中各个信号帧分别对应的初始回波损耗。然后,利用状态系数对初始回波损耗进行调整,得到目标回波损耗。为了避免在单讲回声情况下,错误地认为存在语音并对包络进行错误跟踪,进一步导致音量增益调节快速变大或变小,从而引起回声变大以及后续本地语音信号忽大忽小等一系列声音信号质量低下的问题。在确定出目标回波损耗,并计算出近远端相干值之后,便可判断该目标回波损耗与单讲回声状态是否匹配,若匹配,则在状态统计数据中进行记录,并更新统计信息。然后再利用统计信息更新状态系数,并在更新状态系数时,对目标语音进行音量控制。
该方法相对与现有技术,具有环境适用度高、回声抑制能力强、声音质量高的优点。其中,环境适应度高,即无论是会议环境中混响、噪声、非线性失真等客观因素有多么复杂多变,利用状态统计数据更新状态系数,不需受理论值的限制,更能自适应地反应出当前环境单讲情况下,近远端信号的相似程度;声音质量高,即在状态系数更新时对目标语音信号进行音量控制,可避免单讲情况下回音未消除干净(回声残留轻微或很大)时响度包络的错误跟踪和增益的错误调节,可提高声音的质量。也就是说,本申请实施例所提供的方法,对目标语音信号进行处理之后,可提高目标语音信号的信号质量,能够在目标语音信号被播放时,保证声音质量,可进一步改善用户体验。
实施例二:
为了便于本领域技术人员更好地理解本申请实施例所提供的语音信号处理方法,下面将上述步骤所实现的功能模拟为相应器件后,对本申请实施例所提供的语音信号处理方法,进行详细说明。
请参考图2,图2为本申请实施例中一种语音信号处理方法的技术框图。可见,本申请所提供在语音信号处理方法,可由残留回声抑制器、状态统计分析器、动态范围控制器三部分组成,其中,状态系数作为状态统计分析器的输出,反馈调节残留回声抑制器中回声的估计,同时又为动态范围控制器的响度跟踪作为一种判断依据。
请参考图3,图3为本申请实施例中一种语音信号处理方法的具体实施流程图。该方法的具体实现过程,包括:
S201:信号预处理:残留回声抑制器获得自适应滤波器的残差信号e(n)、估计回声信号y(n)、远端参考信号x(n)和近端信号d(n),并分别进行傅里叶变换,得到残差信号频域信号e(w)、估计回声频域信号Y(w)、远端参考频域信号X(w)和近端频域信号D(w)。同时,分别计算残差信号自相关功率谱S EE(w)、估计回声信号自相关功率谱S YY(w)和远端参考信号自相关功率谱S XX(w);
S202:计算泄露系数:计算当前帧(第l帧)残留回声的泄露系数Leak(l),该值反应回声相对于远端信号的泄露程度,由残差信号e(n)和估计回声y(n)一元线性回归分析得到,其表达为
Figure PCTCN2019111319-appb-000004
其中,R EY(l,w)是残差信号与估计出回声的递归平均得到的频域互相干值,R YY(l,w)是估计回声的递归平均得到的自相关值,其表达式分别为:
R EY(l,w)=(1-β)*R EY(l-1,w)+β*E(w)*Y(w),R YY(l,w)=(1-β)*R YY(l-1,w)+β*S YY(w),其中,R EY(0,w)=1,R EY(0,w)=1,β为平滑系数,取0.93。
S203:计算估计残留回声:即计算残留回声的估计值REcho(w),当泄露因子Leak大于0.5时,残留回声的估计值REcho(w)=S YY(w),否则REcho(w)=2*S YY(w)。此处,如果统计直方图统计数量超过T N帧,系统趋于稳定时:进而采用状态统计分析器反馈的状态系数进行微调,进一步获得残留回声估计值为:REcho(w)=REcho(w)*Ralpha(w),其中,Ralpha(w)为状态统计分析器反馈的状态系数;
S204:对残留回声进行抑制:利用残留回声抑制的函数
Figure PCTCN2019111319-appb-000005
进行回声抑制,从而获得残留回声抑制器输出的信号e2(n);
S205:计算目标回波损耗:利用
Figure PCTCN2019111319-appb-000006
计算目标回波损耗,其中,N为帧长;
S206:计算远近端相干值分布情况:远近端相干平均值为
Figure PCTCN2019111319-appb-000007
其中startbin与endbin对应的频率搜索范围分别为300Hz~3000Hz,此频率段为说话人语音的主要频率范围。其中S XD(w)、S X(w))和S D(w)的计算方式分别如下:
S XD(w)=gcoh*S XD(w)+(1-gcoh)*X(w)*D(w)              ;
S X(w)=gcoh*S X(w)+(1-gcoh)*X(w)*X(w)             ;
S D(w)=gcoh*S D(w)+(1-gcoh)*D(w)*D(w),其中gcoh为遗忘因子,此处取0.93。
S207:更新状态分析统计器中的状态统计直方图:当目标回波损耗ERLR<T erle值(根据ITU G.167标准,此处T erle值建议选-40dB)时,初定为单讲回声状态,则将当前的C XD加入进行状态统计直方图进行更新,同时,状态统计直方图的统计数量加一;否则不加入直方图统计。
S208:计算状态系数:如果状态统计直方图的统计数量大于T N,则取直方图统计的中位值T XD。在此,计算状态系数
Figure PCTCN2019111319-appb-000008
其中alpha为权重系数,取值范围为0~1,建议选取0.5,并该值还可根据环境和计算因子的可靠性选择进行微 调。同时,状态系数的范围控制在:
Figure PCTCN2019111319-appb-000009
之间。如果直方图的统计数量大于T N,则将状态系数反馈给残留回声抑制器和动态范围控制器。
S209:计算输出语音的包络:当前目标语音信号的瞬态包络值为
Figure PCTCN2019111319-appb-000010
当EvenlopTemp>T noise,认为当前状态有语音,考虑到噪声一般分布在-60dB以下,此处T noise取0.000001。在语音信号存在的条件下,当且仅在这两个情况(直方图的统计数量小于T N、直方图的统计数量大于T N且C XD<(0.75*T XD),其中,不等式中的数值0.75,其设置原因主要考虑平均值偏上的值,更具有可靠性,当然,这个数值也可以在0~1之间选择)下,才进行包络更新:
Evenlop=factor*Evenlop+(1-factor)*EvenlopTemp,其中factor为遗忘系数,取0.5;反之,不进行包络更新。
S210:动态增益控制器进行分贝值计算:计算当前帧声音对应的分贝值可表示为:
EvenlopdB(l)=10*log 10(Evenlop)。对信号进行平滑,如果EvenlopdB(l)>EvenlopdB(l-1),则信号的二次平滑值EvenlopdBSmooth可表示为:EvenlopdBSmooth=EvenlopdBSmooth*attact+(1-attact)*EvenlopdB(l),其中attact为启动时间系数,取值范围0~1之间,比如可取0.933;如果EvenlopdB(l)<EvenlopdB(l-1),则信号的二次平滑值EvenlopdBSmooth可表示为:EvenlopdBSmooth=EvenlopdBSmooth*release+(1-release)*EvenlopdB(l),其中release为释放时间系数,取值范围0~1之间,比如可取0.966。
S211:动态增益控制器进行增益值计算:动态增益控制的增益函数可表示为:
Figure PCTCN2019111319-appb-000011
其中G Target为目标分贝值,根据说话人正常的音量范围分布,此处建议取-30dB。为了避免增益值出现毛刺,对增益值进行平滑g out=g out*β+g*(1-β),其中β的取值范围在0-1之间,如可取0.93,从而进行增益调节后输出。
可见,本申请所提供的语音信号处理方法,即为基于统计反馈的回声消除后处理方法,由残留回声抑制器、状态统计分析器、动态范围控制器三部分组成。其中状态统计分析器根据残留回声抑制器输出的残差信号e(n),以及近端信号d(n),实时计算回波损耗ERLE,并实时分析300Hz~3000Hz(人的语音主要频率范围,有利于减少噪声的干扰)之间近远端互相干性的平均值Cxd。在不同复杂环境下,由于混响、噪声、非线性失真等因素的存在,纯单讲情况下近远端互相关值往往不在是理论上的1,因此,当ERLE小于临界值T erle时,认定位单讲场景(与单讲回声状态匹配),则将此时的Cxd加入状态统计直方图进行统计,从而确定该环境单讲情况下近远端相干值大致分布情况。
当状态统计直方图统计量超过T N帧时,认定趋于稳定时,实时取统计值的中值位置所对应的互相干值Txd。由当前时刻下,互相干平均值Cxd与Txd之间的隐射关系,确定当前回声的可能性,从而计算出状态系数,并将该状态系数反馈给前级的残留回声抑制器,用于残留回声的估计一个依据,从而避免残留回声的过小估计,避免回声的泄露;另外,将状态系数输出给动态范围控制器,当状态系数值大于阈值T thre时,尽管语音检测存在时,也不进行包络的跟踪,认为此刻语音中含有的回声比较大地影响包络跟踪,瞬间增益值仍取上个状态的值,从而避免了回声的放大以及本地语音忽然放大或是本地语音忽然变小等互动声音质量低下的问题。
实施例三:
相应于上面的方法实施例,本申请实施例还提供了一种语音信号处理装置,下文描述的语音信号处理装置与上文描述的语音信号处理方法可相互对应参照。
参见图4所示,该装置包括以下模块:
信号获取模块101,用于获取状态系数、待处理的目标语音信号以及目标语音信号的衍生信号;
目标回波损耗获取模块102,用于利用衍生信号计算目标语音信号中各信号帧分别对应的初始回波损耗,并利用状态系数对初始回波损耗进行调整,获得目标回波损耗;
近远端相干值计算模块103,用于利用衍生信号计算目标语言信号的近远 端相干值;
判断模块104,用于判断目标回波损耗与单讲回声状态是否匹配;
近远端相干值记录模块105,用于在目标回波损耗与单讲回波状态匹配时,在状态统计数据中记录近远端相干值,并更新统计信息;
音量控制模块106,用于利用统计信息更新状态系数,并在更新状态系数时对目标语音信号进行音量控制。
应用本申请实施例所提供的装置,获取状态系数、待处理的目标语音信号以及目标语音信号的衍生信号;利用衍生信号计算目标语音信号中各信号帧分别对应的初始回波损耗,并利用状态系数对初始回波损耗进行调整,获得目标回波损耗;利用衍生信号计算目标语言信号的近远端相干值;判断目标回波损耗与单讲回声状态是否匹配;如果是,则在状态统计数据中记录近远端相干值,并更新统计信息;利用统计信息更新状态系数,并在更新状态系数时对目标语音信号进行音量控制。
在获取到状态系数和待处理的目标语音信号以及目标语音信号的衍生信号之后,可利用衍生信号计算目标语音信号中各个信号帧分别对应的初始回波损耗。然后,利用状态系数对初始回波损耗进行调整,得到目标回波损耗。为了避免在单讲回声情况下,错误地认为存在语音并对包络进行错误跟踪,进一步导致音量增益调节快速变大或变小,从而引起回声变大以及后续本地语音信号忽大忽小等一系列声音信号质量低下的问题。在确定出目标回波损耗,并计算出近远端相干值之后,便可判断该目标回波损耗与单讲回声状态是否匹配,若匹配,则在状态统计数据中进行记录,并更新统计信息。然后再利用统计信息更新状态系数,并在更新状态系数时,对目标语音进行音量控制。
该装置相对与现有技术,具有环境适用度高、回声抑制能力强、声音质量高的优点。其中,环境适应度高,即无论是会议环境中混响、噪声、非线性失真等客观因素有多么复杂多变,利用状态统计数据更新状态系数,不需受理论值的限制,更能自适应地反应出当前环境单讲情况下,近远端信号的相似程度;声音质量高,即在状态系数更新时对目标语音信号进行音量控制,可避免单讲情况下回音未消除干净(回声残留轻微或很大)时响度包络的错误跟踪和增益的错误调节,可提高声音的质量。也就是说,本申请实施例所提供的装置,对 目标语音信号进行处理之后,可提高目标语音信号的信号质量,能够在目标语音信号被播放时,保证声音质量,可进一步改善用户体验。
在本申请的一种具体实施方式中,衍生信号包括残差信号、估计回声信号、远端参考信号和近端信号,相应地,目标回波损耗获取模块102,具体用于将衍生信号转换为频域信号,并计算自相关功率谱;其中,频域信号包括残差频域信号、估计回声频域信号、远端参考频域信号和近端频域信号,信号自相关功率谱包括残差信号自相关功率谱、估计回声信号自相关功率谱和远端参考信号自相关功率谱;结合频域信号和自相关功率谱,计算初始回波损耗。
在本申请的一种具体实施方式中,目标回波损耗获取模块102,具体用于对残差信号和估计回声信号进行一元线性回归分析,获得目标语音信号中各个信号帧的泄漏系数;按照预设泄漏系数与信号自相关功率谱的对应关系,并结合信号自相关功率谱,确定残留回声估计值;利用残留回声抑制函数对残留回声进行抑制,获得残留回声抑制信号;利用残留回声抑制信号和近端信号,确定出目标语音信号中各信号帧分别对应的初始回波损耗。
在本申请的一种具体实施方式中,近远端相干值记录模块105,具体用于在人类语音频率范围内,利用频域信号计算近远端相干值。
在本申请的一种具体实施方式中,判断模块104,具体用于判断目标回波损耗是否小于预设阈值;如果是,则确定目标回波损耗与单讲回声状态匹配;如果否,则确定目标回波损耗与单讲回声状态不匹配。
在本申请的一种具体实施方式中,近远端相干值记录模块105,用于状态统计数据以统计直方图的方式存储,并在统计信息中的统计量大于预设统计阈值时,获取统计直方图的中位值;利用中位值重新计算状态系数,并利用计算结果更新状态系数。
在本申请的一种具体实施方式中,音量控制模块106,具体用于在更新状态系数时,计算目标语音信号的瞬态包络值;在瞬态包络值大于预设噪音阈值时,进行包络更新;利用更新后的包络进行分贝值计算,获得分贝计算结果;按照分贝与增益的映射关系,确定与分贝计算结果匹配的增益值,并利用增益值对目标语音信号进行音量控制。
实施例四:
相应于上面的方法实施例,本申请实施例还提供了一种语音信号处理设备,下文描述的一种语音信号处理设备与上文描述的一种语音信号处理方法可相互对应参照。
参见图5所示,该语音信号处理设备包括:
存储器D1,用于存储计算机程序;
处理器D2,用于执行计算机程序时实现上述方法实施例的语音信号处理方法的步骤。
具体的,请参考图6,图6为本实施例提供的一种语音信号处理设备的具体结构示意图,该语音信号处理设备可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(central processing units,CPU)322(例如,一个或一个以上处理器)和存储器332,一个或一个以上存储应用程序342或数据344的存储介质330(例如一个或一个以上海量存储设备)。其中,存储器332和存储介质330可以是短暂存储或持久存储。存储在存储介质330的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对数据处理设备中的一系列指令操作。更进一步地,中央处理器322可以设置为与存储介质330通信,在语音信号处理设备301上执行存储介质330中的一系列指令操作。
语音信号处理设备301还可以包括一个或一个以上电源326,一个或一个以上有线或无线网络接口350,一个或一个以上输入输出接口358,和/或,一个或一个以上操作系统341。例如,Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等。
上文所描述的语音信号处理方法中的步骤可以由语音信号处理设备的结构实现。
实施例五:
相应于上面的方法实施例,本申请实施例还提供了一种可读存储介质,下文描述的一种可读存储介质与上文描述的一种语音信号处理方法可相互对应参照。
一种可读存储介质,可读存储介质上存储有计算机程序,计算机程序被处理器执行时实现上述方法实施例的语音信号处理方法的步骤。
该可读存储介质具体可以为U盘、移动硬盘、只读存储器(Read-Only  Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可存储程序代码的可读存储介质。
专业人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。

Claims (10)

  1. 一种语音信号处理方法,其特征在于,包括:
    获取状态系数、待处理的目标语音信号以及所述目标语音信号的衍生信号;
    利用所述衍生信号计算所述目标语音信号中各信号帧分别对应的初始回波损耗,并利用所述状态系数对所述初始回波损耗进行调整,获得目标回波损耗;
    利用所述衍生信号计算所述目标语言信号的近远端相干值;
    判断所述目标回波损耗与单讲回声状态是否匹配;
    如果是,则在状态统计数据中记录所述近远端相干值,并更新统计信息;
    利用所述统计信息更新所述状态系数,并在更新所述状态系数时对所述目标语音信号进行音量控制。
  2. 根据权利要求1所述的语音信号处理方法,其特征在于,所述衍生信号包括残差信号、估计回声信号、远端参考信号和近端信号;利用所述衍生信号计算所述目标语音信号中各信号帧分别对应的初始回波损耗,包括:
    将所述衍生信号转换为频域信号,并计算自相关功率谱;其中,频域信号包括残差频域信号、估计回声频域信号、远端参考频域信号和近端频域信号,所述信号自相关功率谱包括残差信号自相关功率谱、估计回声信号自相关功率谱和远端参考信号自相关功率谱;
    结合所述频域信号和所述自相关功率谱,计算所述初始回波损耗。
  3. 根据权利要求2所述的语音信号处理方法,其特征在于,所述利用所述频域信号和所述自相关功率谱,计算所述初始回波损耗,包括:
    对所述残差信号和所述估计回声信号进行一元线性回归分析,获得所述目标语音信号中各个所述信号帧的泄漏系数;
    按照预设所述泄漏系数与所述信号自相关功率谱的对应关系,并结合所述信号自相关功率谱,确定残留回声估计值;
    利用残留回声抑制函数对残留回声进行抑制,获得残留回声抑制信号;
    利用所述残留回声抑制信号和所述近端信号,确定出所述目标语音信号中各信号帧分别对应的初始回波损耗。
  4. 根据权利要求2所述的语音信号处理方法,其特征在于,利用所述衍生信号计算所述目标语言信号的近远端相干值,包括:
    在人类语音频率范围内,利用所述频域信号计算所述近远端相干值。
  5. 根据权利要求1所述的语音信号处理方法,其特征在于,判断所述目标回波损耗与单讲回声状态是否匹配,包括:
    判断所述目标回波损耗是否小于预设阈值;
    如果是,则确定所述目标回波损耗与所述单讲回声状态匹配;
    如果否,则确定所述目标回波损耗与所述单讲回声状态不匹配。
  6. 根据权利要求1所述的语音信号处理方法,其特征在于,在所述状态统计数据以统计直方图的方式存储时,利用所述统计信息更新所述状态系数,包括:
    在所述统计信息中的统计量大于预设统计阈值时,获取所述统计直方图的中位值;
    利用所述中位值重新计算所述状态系数,并利用计算结果更新所述状态系数。
  7. 根据权利要求1所述的语音信号处理方法,其特征在于,在更新所述状态系数时对所述目标语音信号进行音量控制,包括:
    在更新所述状态系数时,计算所述目标语音信号的瞬态包络值;
    在所述瞬态包络值大于预设噪音阈值时,进行包络更新;
    利用更新后的包络进行分贝值计算,获得分贝计算结果;
    按照分贝与增益的映射关系,确定与所述分贝计算结果匹配的增益值,并利用所述增益值对所述目标语音信号进行音量控制。
  8. 一种语音信号处理装置,其特征在于,包括:
    信号获取模块,用于获取状态系数、待处理的目标语音信号以及 所述目标语音信号的衍生信号;
    目标回波损耗获取模块,用于利用所述衍生信号计算所述目标语音信号中各信号帧分别对应的初始回波损耗,并利用所述状态系数对所述初始回波损耗进行调整,获得目标回波损耗;
    近远端相干值计算模块,用于利用所述衍生信号计算所述目标语言信号的近远端相干值;
    判断模块,用于判断所述目标回波损耗与单讲回声状态是否匹配;
    近远端相干值记录模块,用于在所述目标回波损耗与所述单讲回波状态匹配时,在状态统计数据中记录所述近远端相干值,并更新统计信息;
    音量控制模块,用于利用所述统计信息更新所述状态系数,并在更新所述状态系数时对所述目标语音信号进行音量控制。
  9. 一种语音信号处理设备,其特征在于,包括:
    存储器,用于存储计算机程序;
    处理器,用于执行所述计算机程序时实现如权利要求1至7任一项所述语音信号处理方法的步骤。
  10. 一种可读存储介质,其特征在于,所述可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至7任一项所述语音信号处理方法的步骤。
PCT/CN2019/111319 2019-03-14 2019-10-15 一种语音信号处理方法、装置、设备及可读存储介质 WO2020181766A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US17/438,640 US11869528B2 (en) 2019-03-14 2019-10-15 Voice signal processing method and device, apparatus, and readable storage medium
EP19918901.0A EP3929919A4 (en) 2019-03-14 2019-10-15 VOICE SIGNAL PROCESSING METHOD AND DEVICE, APPARATUS AND READABLE STORAGE MEDIA

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910195602.X 2019-03-14
CN201910195602.XA CN109767780B (zh) 2019-03-14 2019-03-14 一种语音信号处理方法、装置、设备及可读存储介质

Publications (1)

Publication Number Publication Date
WO2020181766A1 true WO2020181766A1 (zh) 2020-09-17

Family

ID=66458366

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/111319 WO2020181766A1 (zh) 2019-03-14 2019-10-15 一种语音信号处理方法、装置、设备及可读存储介质

Country Status (4)

Country Link
US (1) US11869528B2 (zh)
EP (1) EP3929919A4 (zh)
CN (1) CN109767780B (zh)
WO (1) WO2020181766A1 (zh)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109767780B (zh) * 2019-03-14 2021-01-01 苏州科达科技股份有限公司 一种语音信号处理方法、装置、设备及可读存储介质
CN111785289B (zh) * 2019-07-31 2023-12-05 北京京东尚科信息技术有限公司 残留回声消除方法和装置
CN110660408B (zh) * 2019-09-11 2022-02-22 厦门亿联网络技术股份有限公司 一种数字自动控制增益的方法和装置
CN111556210B (zh) * 2020-04-23 2021-10-22 深圳市未艾智能有限公司 通话语音处理方法与装置、终端设备和存储介质
TWI801978B (zh) * 2020-08-27 2023-05-11 仁寶電腦工業股份有限公司 線上會議系統以及近端主機
CN112397082B (zh) * 2020-11-17 2024-05-14 北京达佳互联信息技术有限公司 估计回声延迟的方法、装置、电子设备和存储介质
CN115240696B (zh) * 2022-07-26 2023-10-03 北京集智数字科技有限公司 一种语音识别方法及可读存储介质
CN116705045B (zh) * 2023-08-09 2023-10-13 腾讯科技(深圳)有限公司 回声消除方法、装置、计算机设备和存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104754157A (zh) * 2013-12-26 2015-07-01 联芯科技有限公司 一种残留回声抑制方法和系统
CN105791611A (zh) * 2016-02-22 2016-07-20 腾讯科技(深圳)有限公司 回声消除方法及装置
CN106657507A (zh) * 2015-11-03 2017-05-10 中移(杭州)信息技术有限公司 一种声学回声消除方法及装置
CN107123430A (zh) * 2017-04-12 2017-09-01 广州视源电子科技股份有限公司 回声消除方法、装置、会议平板及计算机存储介质
US9916840B1 (en) * 2016-12-06 2018-03-13 Amazon Technologies, Inc. Delay estimation for acoustic echo cancellation
CN109767780A (zh) * 2019-03-14 2019-05-17 苏州科达科技股份有限公司 一种语音信号处理方法、装置、设备及可读存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001044896A (ja) * 1999-08-03 2001-02-16 Matsushita Electric Ind Co Ltd 通話装置および通話方法
CN101409744B (zh) * 2008-11-27 2011-11-16 华为终端有限公司 移动终端及其提高通话质量的方法
US8472616B1 (en) 2009-04-02 2013-06-25 Audience, Inc. Self calibration of envelope-based acoustic echo cancellation
US8625776B2 (en) * 2009-09-23 2014-01-07 Polycom, Inc. Detection and suppression of returned audio at near-end
US9363600B2 (en) 2014-05-28 2016-06-07 Apple Inc. Method and apparatus for improved residual echo suppression and flexible tradeoffs in near-end distortion and echo reduction

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104754157A (zh) * 2013-12-26 2015-07-01 联芯科技有限公司 一种残留回声抑制方法和系统
CN106657507A (zh) * 2015-11-03 2017-05-10 中移(杭州)信息技术有限公司 一种声学回声消除方法及装置
CN105791611A (zh) * 2016-02-22 2016-07-20 腾讯科技(深圳)有限公司 回声消除方法及装置
US9916840B1 (en) * 2016-12-06 2018-03-13 Amazon Technologies, Inc. Delay estimation for acoustic echo cancellation
CN107123430A (zh) * 2017-04-12 2017-09-01 广州视源电子科技股份有限公司 回声消除方法、装置、会议平板及计算机存储介质
CN109767780A (zh) * 2019-03-14 2019-05-17 苏州科达科技股份有限公司 一种语音信号处理方法、装置、设备及可读存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3929919A4 *

Also Published As

Publication number Publication date
EP3929919A4 (en) 2022-05-04
CN109767780B (zh) 2021-01-01
CN109767780A (zh) 2019-05-17
EP3929919A1 (en) 2021-12-29
US20220223166A1 (en) 2022-07-14
US11869528B2 (en) 2024-01-09

Similar Documents

Publication Publication Date Title
WO2020181766A1 (zh) 一种语音信号处理方法、装置、设备及可读存储介质
US11297178B2 (en) Method, apparatus, and computer-readable media utilizing residual echo estimate information to derive secondary echo reduction parameters
JP4257113B2 (ja) 音響エコーの相殺および抑制を実行する利得制御方法
CN108141502B (zh) 降低声学系统中的声学反馈的方法及音频信号处理设备
CN108376548B (zh) 一种基于麦克风阵列的回声消除方法与系统
US9319783B1 (en) Attenuation of output audio based on residual echo
JP2009503568A (ja) 雑音環境における音声信号の着実な分離
CN106448691B (zh) 一种用于扩音通信系统的语音增强方法
CN109273019B (zh) 用于回声抑制的双重通话检测的方法及回声抑制
US20120183133A1 (en) Device and method for controlling damping of residual echo
WO2021114779A1 (zh) 基于双端发声检测的回声消除方法、装置及系统
US11538486B2 (en) Echo estimation and management with adaptation of sparse prediction filter set
WO2019068115A1 (en) ECHO CANCELLATION DEVICE AND METHOD THEREOF
CN106571147A (zh) 用于网络话机声学回声抑制的方法
CN111755020A (zh) 一种立体声回声消除方法
CN111355855B (zh) 回声处理方法、装置、设备及存储介质
Pawig et al. Adaptive sampling rate correction for acoustic echo control in voice-over-IP
Yang Multilayer adaptation based complex echo cancellation and voice enhancement
CN112929506B (zh) 音频信号的处理方法及装置,计算机存储介质及电子设备
US20220310106A1 (en) Echo canceller with variable step-size control
US10827076B1 (en) Echo path change monitoring in an acoustic echo canceler
KR20220157475A (ko) 반향 잔류 억제
WO2020191512A1 (zh) 回声消除装置、回声消除方法、信号处理芯片及电子设备
CN111294474B (zh) 一种双端通话检测方法
CN111050005B (zh) 一种偏差补偿的集员仿射投影的回声消除方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19918901

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2019918901

Country of ref document: EP

Effective date: 20210924