CN111292761A - Voice enhancement method and device - Google Patents

Voice enhancement method and device Download PDF

Info

Publication number
CN111292761A
CN111292761A CN201910388459.6A CN201910388459A CN111292761A CN 111292761 A CN111292761 A CN 111292761A CN 201910388459 A CN201910388459 A CN 201910388459A CN 111292761 A CN111292761 A CN 111292761A
Authority
CN
China
Prior art keywords
frame
audio signal
current
signal
ratio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910388459.6A
Other languages
Chinese (zh)
Other versions
CN111292761B (en
Inventor
纪伟
于伟维
潘思伟
雍雅琴
董斐
孟建华
林福辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Spreadtrum Communications Tianjin Co Ltd
Original Assignee
Spreadtrum Communications Tianjin Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Spreadtrum Communications Tianjin Co Ltd filed Critical Spreadtrum Communications Tianjin Co Ltd
Priority to CN201910388459.6A priority Critical patent/CN111292761B/en
Publication of CN111292761A publication Critical patent/CN111292761A/en
Application granted granted Critical
Publication of CN111292761B publication Critical patent/CN111292761B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0324Details of processing therefor
    • G10L21/034Automatic adjustment
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Abstract

The present disclosure relates to a speech enhancement method and apparatus, the method comprising: calculating the voice existence probability of the current frame audio signal, wherein the voice existence probability represents the existence probability of the voice signal in the current frame audio signal; obtaining the noise variance of the next frame of audio signal by using the speech existence probability; and performing voice enhancement on the next frame of audio signal by using the noise variance of the next frame of audio signal. The voice enhancement of the next frame of audio signal based on the voice existence probability can effectively improve the noise suppression level and reduce the loss of the voice signal, thereby ensuring the integrity and intelligibility of the voice signal after noise suppression and improving the voice call quality.

Description

Voice enhancement method and device
Technical Field
The present disclosure relates to the field of speech processing technologies, and in particular, to a speech enhancement method and apparatus.
Background
In the process of voice communication by using mobile equipment such as a mobile phone in daily life, a speaker is often placed in a background environment full of various noises. The speech signal collected by the microphone is contaminated by noise, i.e. the upstream signal is a speech signal containing noise. Without processing the uplink signal, the far-end receiver will have difficulty hearing clear speech and even have no understanding of the meaning of the speech. Therefore, noise suppression must be performed on the near-end noisy speech, and the clean speech after speech enhancement is used as an uplink signal, so as to improve the call quality.
Disclosure of Invention
In view of the above, according to one aspect of the present disclosure, the present disclosure proposes a speech enhancement method, the method including:
calculating the voice existence probability of the current frame audio signal, wherein the voice existence probability represents the existence probability of the voice signal in the current frame audio signal;
obtaining the noise variance of the next frame of audio signal by using the speech existence probability;
and performing voice enhancement on the next frame of audio signal by using the noise variance of the next frame of audio signal.
In a possible implementation, the calculating the speech existence probability of the current frame audio signal includes:
calculating the voice nonexistence probability of the current frame audio signal, wherein the voice nonexistence probability is the nonexistence probability of the voice signal in the current frame audio signal;
calculating the prior signal-to-noise ratio of the current frame audio signal;
and calculating the voice existence probability of the current frame audio signal by utilizing the voice nonexistence probability of the current frame audio signal and the prior signal-to-noise ratio.
In a possible implementation, the calculating the speech non-existence probability of the current frame audio signal includes:
calculating the minimum value of the power spectrum of the current frame audio signal according to the power spectrum of the current frame audio signal;
and calculating the voice non-existence probability of the current frame audio signal by using the power spectrum minimum value.
In one possible implementation, the calculating the minimum value of the power spectrum of the current frame audio signal according to the power spectrum of the current frame audio signal includes:
acquiring the minimum value of the power spectrum of the current frame audio signal by using the following formula:
Figure BDA0002055631390000021
wherein S ismin(k, λ) represents the minimum value of the power spectrum of the kth subband of the current λ frame, Smin(k, λ -1) represents the power spectrum minimum of the k-th subband of the λ -1 frame, S (k, λ) represents the power spectrum of the k-th subband of the current λ frame, S (k, λ -1) represents the power spectrum of the k-th subband of the λ -1 frame, α1,α2And β are preset parameters.
In a possible implementation, the calculating the speech non-existence probability of the current frame audio signal by using the power spectrum minimum value includes:
obtaining a first ratio of the current frame audio signal by using the following formula:
Figure BDA0002055631390000022
wherein, γmin(k, λ) represents a first ratio of the kth subband of the current λ frame, and Y (k, λ) represents the kth subband of the current λ frameAmplitude spectrum of the sub-band, Smin(k, λ) represents the power spectrum minimum value of the kth sub-band of the current λ frame, B represents a bias compensation parameter;
obtaining a second ratio of the current frame audio signal according to the following formula:
Figure BDA0002055631390000023
wherein η (k, λ) represents a second ratio of the kth subband of the current λ frame, and S (k, λ) represents a power spectrum of the kth subband of the current λ frame;
and calculating the voice non-existence probability of the current frame signal according to the first ratio and the second ratio.
In a possible implementation manner, the calculating the speech non-existence probability of the current frame signal according to the first ratio and the second ratio includes:
determining the voice non-existence probability by the following formula under the condition that the first ratio is less than or equal to 1 and the second ratio is less than a first preset threshold value:
q (k, λ) ═ 1, where q (k, λ) represents the probability of speech absence for the kth subband of the current λ frame; or
Determining the voice non-existence probability through the following formula under the condition that the first ratio is greater than 1 and less than or equal to a second preset threshold and the second ratio is less than the first preset threshold:
Figure BDA0002055631390000031
wherein, γ1Represents said second preset threshold value, γmin(k, λ) represents the first ratio; or
Determining the voice non-existence probability through the following formula under the condition that the first ratio is greater than or equal to the second preset threshold or the second ratio is greater than or equal to the first preset threshold:
q(k,λ)=0。
in a possible implementation, the calculating the prior signal-to-noise ratio of the current frame audio signal includes:
calculating the posterior signal-to-noise ratio of the current frame audio signal;
calculating the prior signal-to-noise ratio of the current frame audio signal by using the following formula:
ξ (k, λ) ═ ε G (k, λ -1) γ (k, λ -1) + (1- ε) max { γ (k, λ) -1,0}, where ξ (k, λ) represents the prior SNR for the kth subband of the current λ frame, G (k, λ -1) represents the amplitude spectral gain for the kth subband of the λ -1 frame, γ (k, λ -1) represents the A posteriori SNR for the kth subband of the λ -1 frame, γ (k, λ) represents the A posteriori SNR for the kth subband of the current λ frame, and ε is a constant.
In a possible implementation, the calculating the a posteriori signal-to-noise ratio of the current frame audio signal includes:
the a posteriori signal to noise ratio is calculated using the formula:
Figure BDA0002055631390000041
wherein Y (k, λ) represents the amplitude spectrum of the k-th sub-band of the current λ frame,
Figure BDA0002055631390000042
representing the noise variance of the kth subband of the current lambda frame.
In a possible implementation manner, the calculating the speech existence probability of the current frame audio signal by using the speech absence probability and the prior signal-to-noise ratio of the current frame audio signal includes:
calculating the speech existence probability of the current frame audio signal by using the following formula:
Figure BDA0002055631390000043
where p (k, λ) represents the probability of speech being present for the k-th subband of the current λ frame, q (k, λ) represents the probability of speech being absent for the k-th subband of the current λ frame, ξ (k, λ) represents the a priori signal-to-noise ratio for the k-th subband of the current λ frame,
wherein the content of the first and second substances,
Figure BDA0002055631390000044
γ (k, λ) represents the a posteriori signal-to-noise ratio of the k subband of the current λ frame.
In a possible implementation, the obtaining the noise variance of the audio signal of the next frame by using the speech existence probability includes:
the noise variance of the audio signal of the next frame is calculated using the following formula:
Figure BDA0002055631390000045
αD(k,λ)=αd+(1-αd)p(k,λ);
wherein the content of the first and second substances,
Figure BDA0002055631390000046
representing the noise variance of the k-th subband of the lambda +1 frame,
Figure BDA0002055631390000047
representing the noise variance of the k-th subband of the current λ frame, Y (k, λ) representing the magnitude spectrum of the k-th subband of the current λ frame, p (k, λ) representing the probability of speech presence of the k-th subband of the current λ frame, αdIs a constant.
In a possible embodiment, the performing speech enhancement on the next frame audio signal by using the noise variance of the next frame audio signal includes:
obtaining the posterior signal-to-noise ratio of the next frame of audio signal by using the noise variance of the next frame of audio signal;
obtaining the prior signal-to-noise ratio of the next frame of audio signal by utilizing the posterior signal-to-noise ratio of the next frame of audio signal;
obtaining the amplitude spectrum gain of the next frame of audio signal by utilizing the posterior signal-to-noise ratio and the prior signal-to-noise ratio of the next frame of audio signal;
and performing voice enhancement on the next frame of audio signal by using the amplitude spectrum gain of the next frame of audio signal.
In a possible implementation manner, the obtaining the amplitude spectrum gain of the next frame audio signal by using the a posteriori signal-to-noise ratio and the a priori signal-to-noise ratio of the next frame audio signal includes:
obtaining the amplitude spectrum gain of the audio signal of the next frame by using the following formula:
Figure BDA0002055631390000051
where λ +1 frame represents the next frame audio signal, G (k, λ +1) represents the amplitude spectral gain of the k-th sub-band of λ +1 frame, ξ (k, λ +1) represents the a priori signal-to-noise ratio of the k-th sub-band of λ +1 frame,
wherein the content of the first and second substances,
Figure BDA0002055631390000052
γ (k, λ +1) represents the a posteriori signal-to-noise ratio of the k-th sub-band of the λ +1 frame.
According to another aspect of the present disclosure, a speech enhancement apparatus is proposed, the apparatus comprising:
the calculating module is used for calculating the voice existence probability of the current frame audio signal, and the voice existence probability represents the existence probability of the voice signal in the current frame audio signal;
the obtaining module is connected with the calculating module and used for obtaining the noise variance of the audio signal of the next frame by utilizing the voice existence probability;
and the voice enhancement module is connected with the obtaining module and used for carrying out voice enhancement on the next frame of audio signal by utilizing the noise variance of the next frame of audio signal.
In one possible implementation, the calculation module includes:
the first calculation submodule is used for calculating the voice nonexistence probability of the current frame audio signal, and the voice nonexistence probability is the nonexistence probability of the voice signal in the current frame audio signal;
the second calculation submodule is used for calculating the prior signal-to-noise ratio of the current frame audio signal;
and the third calculation submodule is used for calculating the voice existence probability of the current frame audio signal by utilizing the voice nonexistence probability of the current frame audio signal and the prior signal-to-noise ratio.
In a possible implementation, the calculating the speech non-existence probability of the current frame audio signal includes:
calculating the minimum value of the power spectrum of the current frame audio signal according to the power spectrum of the current frame audio signal;
and calculating the voice non-existence probability of the current frame audio signal by using the power spectrum minimum value.
In one possible implementation, the calculating the minimum value of the power spectrum of the current frame audio signal according to the power spectrum of the current frame audio signal includes:
acquiring the minimum value of the power spectrum of the current frame audio signal by using the following formula:
Figure BDA0002055631390000061
wherein S ismin(k, λ) represents the minimum value of the power spectrum of the kth subband of the current λ frame, Smin(k, λ -1) represents the power spectrum minimum of the k-th subband of the λ -1 frame, S (k, λ) represents the power spectrum of the k-th subband of the current λ frame, S (k, λ -1) represents the power spectrum of the k-th subband of the λ -1 frame, α1,α2And β are preset parameters.
In a possible implementation, the calculating the speech non-existence probability of the current frame audio signal by using the power spectrum minimum value includes:
obtaining a first ratio of the current frame audio signal by using the following formula:
Figure BDA0002055631390000062
wherein, γmin(k, λ) represents a first ratio of the kth subband of the current λ frame, Y (k, λ) represents a magnitude spectrum of the kth subband of the current λ frame, Smin(k, λ) represents the power spectrum minimum value of the kth sub-band of the current λ frame, B represents a bias compensation parameter;
obtaining a second ratio of the current frame audio signal according to the following formula:
Figure BDA0002055631390000063
wherein η (k, λ) represents a second ratio of the kth subband of the current λ frame, and S (k, λ) represents a power spectrum of the kth subband of the current λ frame;
and calculating the voice non-existence probability of the current frame signal according to the first ratio and the second ratio.
In a possible implementation manner, the calculating the speech non-existence probability of the current frame signal according to the first ratio and the second ratio includes:
determining the voice non-existence probability by the following formula under the condition that the first ratio is less than or equal to 1 and the second ratio is less than a first preset threshold value:
q (k, λ) ═ 1, where q (k, λ) represents the probability of speech absence for the kth subband of the current λ frame; or
Determining the voice non-existence probability through the following formula under the condition that the first ratio is greater than 1 and less than or equal to a second preset threshold and the second ratio is less than the first preset threshold:
Figure BDA0002055631390000071
wherein, γ1Represents said second preset threshold value, γmin(k, λ) represents the first ratio; or
Determining the voice non-existence probability through the following formula under the condition that the first ratio is greater than or equal to the second preset threshold or the second ratio is greater than or equal to the first preset threshold:
q(k,λ)=0。
in a possible implementation, the calculating the prior signal-to-noise ratio of the current frame audio signal includes:
calculating the posterior signal-to-noise ratio of the current frame audio signal;
calculating the prior signal-to-noise ratio of the current frame audio signal by using the following formula:
ξ (k, λ) ═ ε G (k, λ -1) γ (k, λ -1) + (1- ε) max { γ (k, λ) -1,0}, where ξ (k, λ) represents the prior SNR for the kth subband of the current λ frame, G (k, λ -1) represents the amplitude spectral gain for the kth subband of the λ -1 frame, γ (k, λ -1) represents the A posteriori SNR for the kth subband of the λ -1 frame, γ (k, λ) represents the A posteriori SNR for the kth subband of the current λ frame, and ε is a constant.
In a possible implementation, the calculating the a posteriori signal-to-noise ratio of the current frame audio signal includes:
the a posteriori signal to noise ratio is calculated using the formula:
Figure BDA0002055631390000081
wherein Y (k, λ) represents the amplitude spectrum of the k-th sub-band of the current λ frame,
Figure BDA0002055631390000082
representing the noise variance of the kth subband of the current lambda frame.
In a possible implementation manner, the calculating the speech existence probability of the current frame audio signal by using the speech absence probability and the prior signal-to-noise ratio of the current frame audio signal includes:
calculating the speech existence probability of the current frame audio signal by using the following formula:
Figure BDA0002055631390000083
where p (k, λ) represents the probability of speech being present for the k-th subband of the current λ frame, q (k, λ) represents the probability of speech being absent for the k-th subband of the current λ frame, ξ (k, λ) represents the a priori signal-to-noise ratio for the k-th subband of the current λ frame,
wherein the content of the first and second substances,
Figure BDA0002055631390000084
γ (k, λ) represents the a posteriori signal-to-noise ratio of the k subband of the current λ frame.
In a possible implementation, the obtaining the noise variance of the audio signal of the next frame by using the speech existence probability includes:
the noise variance of the audio signal of the next frame is calculated using the following formula:
Figure BDA0002055631390000085
αD(k,λ)=αd+(1-αd)p(k,λ);
wherein the content of the first and second substances,
Figure BDA0002055631390000086
representing the noise variance of the k-th subband of the lambda +1 frame,
Figure BDA0002055631390000087
representing the noise variance of the k-th subband of the current λ frame, Y (k, λ) representing the magnitude spectrum of the k-th subband of the current λ frame, p (k, λ) representing the probability of speech presence of the k-th subband of the current λ frame, αdIs a constant.
In one possible embodiment, the speech enhancement module includes:
the posterior signal-to-noise ratio obtaining submodule is used for obtaining the posterior signal-to-noise ratio of the next frame of audio signal by utilizing the noise variance of the next frame of audio signal;
the prior signal-to-noise ratio obtaining submodule is used for obtaining the prior signal-to-noise ratio of the next frame of audio signal by utilizing the posterior signal-to-noise ratio of the next frame of audio signal;
the amplitude spectrum gain obtaining submodule is used for obtaining the amplitude spectrum gain of the next frame of audio signal by utilizing the posterior signal-to-noise ratio and the prior signal-to-noise ratio of the next frame of audio signal;
and the voice enhancement submodule is used for carrying out voice enhancement on the next frame of audio signal by utilizing the amplitude spectrum gain of the next frame of audio signal.
In a possible implementation manner, the obtaining the amplitude spectrum gain of the next frame audio signal by using the a posteriori signal-to-noise ratio and the a priori signal-to-noise ratio of the next frame audio signal includes:
obtaining the amplitude spectrum gain of the audio signal of the next frame by using the following formula:
Figure BDA0002055631390000091
where λ +1 frame represents the next frame audio signal, G (k, λ +1) represents the amplitude spectral gain of the k-th sub-band of λ +1 frame, ξ (k, λ +1) represents the a priori signal-to-noise ratio of the k-th sub-band of λ +1 frame,
wherein the content of the first and second substances,
Figure BDA0002055631390000092
γ (k, λ +1) represents the a posteriori signal-to-noise ratio of the k-th sub-band of the λ +1 frame.
According to another aspect of the present disclosure, there is provided a speech enhancement apparatus including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to perform the above method.
According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having computer program instructions stored thereon, wherein the computer program instructions, when executed by a processor, implement the above-described method.
Through the method, the method calculates the voice existence probability of the current frame audio signal, obtains the noise variance of the next frame audio signal by using the voice existence probability, and performs voice enhancement on the next frame audio signal by using the noise variance of the next frame audio signal. The method is suitable for the non-stable environment noise with low signal-to-noise ratio, can quickly track the change of noise intensity and update the noise variance in time, can effectively improve the noise suppression level and reduce the loss of the voice signal based on the voice enhancement of the voice existence probability to the next frame of audio signal, thereby ensuring the integrity and intelligibility of the voice signal after noise suppression and improving the voice conversation quality.
Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.
FIG. 1 shows a flow diagram of a speech enhancement method according to an embodiment of the present disclosure.
Fig. 2 shows a schematic diagram of step S110 in a speech enhancement method according to an embodiment of the present disclosure.
Fig. 3 shows a schematic diagram of step S130 in a speech enhancement method according to an embodiment of the present disclosure.
Fig. 4a shows an effect diagram of applying a speech enhancement method according to an embodiment of the present disclosure, and fig. 4b shows an effect diagram of adopting a related art.
FIG. 5 shows a block diagram of a speech enhancement apparatus according to an embodiment of the present disclosure.
FIG. 6 shows a block diagram of a speech enhancement apparatus according to an embodiment of the present disclosure.
FIG. 7 shows a block diagram of a speech enhancement apparatus according to an embodiment of the present disclosure.
Detailed Description
Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.
Related art speech enhancement techniques typically use voice activity detection to determine whether speech is present in a noisy speech signal, i.e., to identify speech segments and non-speech (noise) segments from a segment of noisy speech. The voice activity detection needs to track the change of the environmental noise constantly, and the accuracy of the judgment directly influences the voice enhancement effect. However, when the environmental noise is non-stationary noise, the noise intensity varies irregularly with time, and the noise intensity is high, so that the input signal-to-noise ratio is low, the voice activity detection cannot effectively distinguish noise from voice, and thus the noise suppression effect is poor, and pure original voice cannot be obtained after voice enhancement.
When the environmental noise level changes suddenly, the voice enhancement algorithm in the related technology has a large tracking delay on the estimation of the noise parameter, and especially for the non-stationary noise with irregularly changing noise intensity, the delay cannot ensure the integrity of the voice signal restoration, and has a great influence on the call quality.
Accordingly, the present disclosure is directed to a speech enhancement method for overcoming the problems occurring in the related art.
Turning to fig. 1, fig. 1 shows a flow chart of a speech enhancement method according to an embodiment of the present disclosure.
The method can be applied to a terminal, which is also called User Equipment (UE), a Mobile Station (MS), a Mobile Terminal (MT), and the like, and is a device that provides voice and/or data connectivity to a user, for example, a handheld device, a vehicle-mounted device, and the like having a wireless connection function. Currently, some examples of terminals are: a mobile phone (mobile phone), a tablet computer, a notebook computer, a palm top computer, a Mobile Internet Device (MID), a wearable device, a Virtual Reality (VR) device, an Augmented Reality (AR) device, a wireless terminal in industrial control (industrial control), a wireless terminal in unmanned driving (self), a wireless terminal in remote surgery (remote medical supply), a wireless terminal in smart grid (smart grid), a wireless terminal in transportation safety (transportation safety), a wireless terminal in city (smart city), a wireless terminal in smart home (smart home), a wireless terminal in vehicle networking, and the like.
As shown in fig. 1, the method includes:
step S110, calculating the voice existence probability of the current frame audio signal, wherein the voice existence probability represents the existence probability of the voice signal in the current frame audio signal;
step S120, obtaining the noise variance of the next frame of audio signal by using the voice existence probability;
step S130, performing speech enhancement on the next frame of audio signal by using the noise variance of the next frame of audio signal.
Through the method, the method calculates the voice existence probability of the current frame audio signal, obtains the noise variance of the next frame audio signal by using the voice existence probability, and performs voice enhancement on the next frame audio signal by using the noise variance of the next frame audio signal. The method is suitable for the non-stable environment noise with low signal-to-noise ratio, can quickly track the change of noise intensity and update the noise variance in time, can effectively improve the noise suppression level and reduce the loss of the voice signal based on the voice enhancement of the voice existence probability to the next frame of audio signal, thereby ensuring the integrity and intelligibility of the voice signal after noise suppression and improving the voice conversation quality.
In one possible implementation, the step S110 of calculating the speech existence probability of the current audio signal may be performed in a frequency domain.
The audio signal may be represented as y (n) ═ x (n) + d (n), where x (n) is the clean speech signal and d (n) is the noise signal, i.e., the audio signal is noisy.
For example, the noisy audio signal may be subjected to framing, windowing, and fast fourier transform processing to convert the audio signal from the time domain to the frequency domain.
In one possible implementation, the speech signal may be framed, windowed, Fast Fourier Transformed (FFT) processed using correlation techniques. For example, the framing process may be to frame the audio signal according to a required frame length, wherein the time length of each frame may be set according to needs, which is not limited by the present disclosure. A window function commonly used, such as a Hamming window (Hamming), may be selected for windowing, for example, the Hamming window may be multiplied by the framed noisy audio signal in the time domain to obtain a windowed signal. After performing a fast fourier transform on the windowed signal, the noisy audio signal may be converted from the time domain to the frequency domain. The disclosed embodiments may process audio signals in the frequency domain, thereby enabling speech enhancement.
The frequency domain signal of one frame of audio signal may be divided into sub-bands and speech enhancement may be performed for each sub-band, thereby implementing speech enhancement for one frame of audio signal.
Referring to fig. 2, fig. 2 is a schematic diagram illustrating step S110 in a speech enhancement method according to an embodiment of the present disclosure.
As shown in fig. 2, the step S110 of calculating the speech existence probability of the current frame audio signal may include:
step S1101, calculating a speech absence probability of a current frame audio signal, where the speech absence probability is an absence probability of a speech signal in the current frame audio signal;
step S1102, calculating the prior signal-to-noise ratio of the current frame audio signal;
step S1103, calculating the speech existence probability of the current frame audio signal by using the speech nonexistence probability of the current frame audio signal and the prior signal-to-noise ratio.
The present disclosure may perform soft decision on an input audio signal using a speech existence probability of a current frame audio signal to distinguish a noise signal from a speech signal in the audio signal.
For example, H may be assumed0And H1Respectively represent the absence and presence of a speech signal at the kth sub-band frequency of the λ -th frame, the speech presence probability may be defined as P (k, λ) ═ P (H)1(k, λ) | γ (k, λ)), where γ (k, λ) represents the posterior signal-to-noise ratio of the k-th subband of the current λ frame, p (k, λ) represents the speech presence probability of the k-th subband of the λ frame, H (k, λ) represents the speech presence probability of the k-th subband of the1(k, λ) denotes the kth sub-frame of the λ -frameA speech signal of a frequency band exists.
In one possible implementation, the step S1101 of calculating the speech non-existence probability of the current frame audio signal may include:
calculating the minimum value of the power spectrum of the current frame audio signal according to the power spectrum of the current frame audio signal;
and calculating the voice non-existence probability of the current frame audio signal by using the power spectrum minimum value.
In one possible implementation, the calculating the minimum value of the power spectrum of the current frame audio signal according to the power spectrum of the current frame audio signal may include:
acquiring the minimum value of the power spectrum of the current frame audio signal by using the following formula:
Figure BDA0002055631390000141
wherein S ismin(k, λ) represents the minimum value of the power spectrum of the kth subband of the current λ frame, Smin(k, λ -1) represents the power spectrum minimum of the k-th subband of the λ -1 frame, S (k, λ) represents the power spectrum of the k-th subband of the current λ frame, S (k, λ -1) represents the power spectrum of the k-th subband of the λ -1 frame, α1,α2And β are preset parameters.
Wherein, α12β can be [0,1]],α1And α2Can be taken as a smoothing parameter, β can be taken as a look-ahead parameter, the present disclosure is through control α12β, the minimum value of the power spectrum can be controlled by the values of the three preset parameters, so as to control the intensity and the updating rate of the noise estimation.
In one possible embodiment, α1May be 0.5, α2May be 0.95 and β may be 0.5.
Through the method, the power spectrum of the audio signal containing the noise can be smoothed, and the minimum value of the power spectrum of the audio signal is obtained through the smoothed power spectrum.
In a possible implementation, the calculating the speech non-existence probability of the current frame audio signal by using the power spectrum minimum value may include:
obtaining a first ratio of the current frame audio signal by using the following formula:
Figure BDA0002055631390000142
wherein, γmin(k, λ) represents a first ratio of the kth subband of the current λ frame, Y (k, λ) represents a magnitude spectrum of the kth subband of the current λ frame, Smin(k, λ) represents the power spectrum minimum value of the kth sub-band of the current λ frame, B represents a bias compensation parameter;
obtaining a second ratio of the current frame audio signal according to the following formula:
Figure BDA0002055631390000143
wherein η (k, λ) represents a second ratio of the kth subband of the current λ frame, and S (k, λ) represents a power spectrum of the kth subband of the current λ frame;
and calculating the voice non-existence probability of the current frame signal according to the first ratio and the second ratio.
In one possible embodiment, the bias compensation parameter B may be calculated by calculating a power spectrum minimum SminThe reciprocal of the expected value of (k, λ) is obtained.
In one possible embodiment, B may be 1.66.
In a possible implementation manner, the calculating the speech non-existence probability of the current frame signal according to the first ratio and the second ratio may include:
when the first ratio is less than or equal to 1 and the second ratio is less than a first preset threshold (η)0) In the case of (γ)min(k, lambda) is less than or equal to 1 and η (k, lambda) is less than η0) Determining the speech absence probability by:
q (k, λ) ═ 1, where q (k, λ) represents the probability of speech absence for the kth subband of the current λ frame; or
When the first ratio is larger than 1 and smallerIs equal to or lower than a second preset threshold value, and the second ratio is smaller than the first preset threshold value (1 < gamma)min(k,λ)≤γ1And η (k, λ) < η0) Determining the speech absence probability by:
Figure BDA0002055631390000151
wherein, γ1Represents said second preset threshold value, γmin(k, λ) represents the first ratio; or
In the case where the first ratio is equal to or greater than the second preset threshold value, or the second ratio is equal to or greater than the first preset threshold value (γ)min(k,λ)≥γ1Or η (k, lambda) ≥ η0) Determining the speech absence probability by:
q(k,λ)=0。
the first preset threshold and the second preset threshold may be set as desired, and in one possible embodiment, η0May be 1.7, gamma1May be 3.
In one possible implementation, the definition of the a priori signal-to-noise ratio may be:
Figure BDA0002055631390000161
that is, the a priori signal-to-noise ratio may refer to the ratio of the power of the clean speech signal to the power of the noise,
Figure BDA0002055631390000162
and
Figure BDA0002055631390000163
the variance of the clean speech signal and the noise signal, respectively.
Due to the variance of the clean speech signal
Figure BDA0002055631390000164
It is unknown, therefore, it is necessary to calculate the prior snr of the current frame audio signal by other methods, for example, the prior snr of the current frame audio signal can be calculated by the following methods:
in one possible implementation, the step S1102 of calculating the prior signal-to-noise ratio of the current frame audio signal may include:
calculating the posterior signal-to-noise ratio of the current frame audio signal;
calculating the prior signal-to-noise ratio of the current frame audio signal by using the following formula:
ξ (k, λ) ═ ε G (k, λ -1) γ (k, λ -1) + (1- ε) max { γ (k, λ) -1,0}, where ξ (k, λ) represents the prior SNR for the kth subband of the current λ frame, G (k, λ -1) represents the amplitude spectral gain for the kth subband of the λ -1 frame, γ (k, λ -1) represents the A posteriori SNR for the kth subband of the λ -1 frame, γ (k, λ) represents the A posteriori SNR for the kth subband of the current λ frame, and ε is a constant.
The amplitude spectral gain of the kth subband of the first frame may be determined as required, and in a possible embodiment, the amplitude spectral gain of the kth subband of the first frame may be: g (k,0) ═ 1.
The value range of epsilon can be determined as desired, and in a possible embodiment, the value range of epsilon can be [0,1 ].
In one possible embodiment, ε may be 0.9.
In one possible embodiment, the a posteriori snr can be defined as: the ratio of the power of the noisy speech signal to the noise power. Therefore, the calculating the posterior signal-to-noise ratio of the current frame audio signal may include:
the a posteriori signal to noise ratio is calculated using the formula:
Figure BDA0002055631390000171
wherein Y (k, λ) represents the amplitude spectrum of the k-th sub-band of the current λ frame,
Figure BDA0002055631390000172
representing the noise variance of the kth subband of the current lambda frame.
In a possible implementation manner, the step S1103 of calculating the speech existence probability of the current frame audio signal by using the speech absence probability of the current frame audio signal and the prior signal-to-noise ratio may include:
calculating the speech existence probability of the current frame audio signal by using the following formula:
Figure BDA0002055631390000173
where p (k, λ) represents the probability of speech being present for the k-th subband of the current λ frame, q (k, λ) represents the probability of speech being absent for the k-th subband of the current λ frame, ξ (k, λ) represents the prior signal-to-noise ratio for the k-th subband of the current λ frame,
wherein the content of the first and second substances,
Figure BDA0002055631390000174
γ (k, λ) represents the a posteriori signal-to-noise ratio of the k subband of the current λ frame.
In a possible implementation, the step S120 of obtaining the noise variance of the audio signal of the next frame by using the speech existence probability may include:
the noise variance of the audio signal of the next frame is calculated using the following formula:
Figure BDA0002055631390000175
αD(k,λ)=αd+(1-αd)p(k,λ);
wherein the content of the first and second substances,
Figure BDA0002055631390000176
representing the noise variance of the k-th subband of the lambda +1 frame,
Figure BDA0002055631390000177
representing the noise variance of the k-th subband of the current λ frame, Y (k, λ) representing the magnitude spectrum of the k-th subband of the current λ frame, p (k, λ) representing the probability of speech presence of the k-th subband of the current λ frame, αdIs a constant.
In one possible embodiment of the method according to the invention,
Figure BDA0002055631390000181
initial value at frame 0
Figure BDA0002055631390000182
Can be the square of the amplitude spectrum of the 0 th frame noisy speech signal, i.e., | Y (k,0) & gtsurvival2
αdCan be determined as desired, and in one possible embodiment, αdCan be a value of [0,1]]。
In one possible embodiment, αdMay be 0.8.
By the above method, the present disclosure can obtain the noise variance of the audio signal of the next frame using the speech existence probability.
Referring to fig. 3, fig. 3 is a schematic diagram illustrating step S130 in a speech enhancement method according to an embodiment of the present disclosure.
As shown in fig. 3, in a possible implementation, the step S130 of performing speech enhancement on the next frame audio signal by using the noise variance of the next frame audio signal may include:
step S1301, obtaining the posterior signal-to-noise ratio of the next frame of audio signal by using the noise variance of the next frame of audio signal;
step S1302, obtaining the prior signal-to-noise ratio of the next frame of audio signal by using the posterior signal-to-noise ratio of the next frame of audio signal;
step S1303, obtaining the amplitude spectrum gain of the next frame of audio signal by using the posterior signal-to-noise ratio and the prior signal-to-noise ratio of the next frame of audio signal;
step S1304, performing speech enhancement on the next frame of audio signal by using the amplitude spectrum gain of the next frame of audio signal.
In a possible implementation manner, the step S1303 obtaining the magnitude spectrum gain of the next frame of audio signal by using the a posteriori signal-to-noise ratio and the a priori signal-to-noise ratio of the next frame of audio signal may include:
obtaining the amplitude spectrum gain of the audio signal of the next frame by using the following formula:
Figure BDA0002055631390000183
where λ +1 frame represents the next frame audio signal, G (k, λ +1) represents the amplitude spectral gain of the k-th sub-band of λ +1 frame, ξ (k, λ +1) represents the a priori signal-to-noise ratio of the k-th sub-band of λ +1 frame,
wherein the content of the first and second substances,
Figure BDA0002055631390000191
γ (k, λ +1) represents the a posteriori signal-to-noise ratio of the k-th sub-band of the λ +1 frame.
Since the amplitude spectrum gain is obtained based on the voice existence probability, the value of the amplitude spectrum gain can be adaptively obtained according to the voice existence probability of the sub-band of the current frame, that is, the amplitude spectrum size is related to the voice existence probability.
It should be noted that the noise bandwidth and the voice bandwidth in the audio signal can be considered the same, that is, each sub-band of voice is interfered by noise. Then, from the frequency domain perspective, the spectral amplitude value Y of the noisy speech (audio signal) at each sub-band is the sum of the speech spectral amplitude Yx and the noise spectral amplitude Yn of the frequency point. The speech enhancement algorithm is to calculate a magnitude spectrum gain G for each subband, so that the spectral amplitude of the speech can be recovered by multiplying the magnitude spectrum gain G by a spectral amplitude value Y.
For example, for a certain sub-band of a certain frame of the audio signal, when the algorithm determines that it is speech, the amplitude spectrum gain is set to 1 or close to 1; when the algorithm judges that the noise is generated, the gain of the G amplitude spectrum is set to be 0 or approximately 0; if the algorithm determines that it is a mixture of noise and speech, the G-magnitude spectral gain is set to between 0-1. In this way, the noise suppression level can be raised and the speech signal is preserved.
In a possible implementation, the step S1304 performing speech enhancement on the next frame of audio signal by using the amplitude spectrum gain of the next frame of audio signal may include:
and multiplying the obtained amplitude spectrum gain of the next frame of audio signal with the amplitude spectrum of the next frame of audio signal in the frequency domain to obtain an enhanced amplitude spectrum.
After obtaining the enhanced magnitude spectrum, a fast inverse fourier transform (IFFT) may be performed to obtain a time domain signal. The time domain signal and the synthesis window are multiplied in the time domain to obtain a windowed signal, and an enhanced voice output signal is further obtained by an overlap-add method. Of course, the above description is exemplary, and after obtaining the enhanced magnitude spectrum, one skilled in the art can select a specific method to obtain the speech output signal using the magnitude spectrum as needed.
Referring to fig. 4a-4b together, fig. 4a is a schematic diagram illustrating an effect of applying a speech enhancement method according to an embodiment of the present disclosure, and fig. 4b is a schematic diagram illustrating an effect of using a related art.
In fig. 4a and 4b, the vertical axis represents frequency and the horizontal axis represents time.
α when setting ε equal to 0.91=0.5,α2=0.95,β=0.5,B=1.66,γ1=3,η0=1.7,αdWhen the noise starts at time 0, it can be seen from fig. 4a that the speech enhancement method of the present disclosure can track the noise level change within 1 second, complete the noise parameter estimation, and implement noise suppression. In fig. 4b, the related art (e.g., the minimum control method) requires approximately 3 seconds to complete the noise tracking.
And, when the 15 th second noise level has a sudden change, as shown in fig. 4a, the speech enhancement method of the present disclosure can quickly complete the noise variance estimation, thereby tracking the noise level change and ensuring that the enhanced speech signal is not affected by the burst noise. However, as shown in fig. 4b, when the noise level abruptly changes at the 15 th second, it is obvious that the related art cannot quickly track the change of the noise level.
Through the method, the noise variance of the sudden steady noise or the non-steady noise which continuously changes along with time can be rapidly updated according to the noise level in the non-steady noise environment, and the noise suppression level can be remarkably improved. And the soft decision based on the voice existence probability can be effectively suitable for the condition that the low signal-to-noise ratio contains non-stationary noise voice signals, and the voice call quality can be obviously improved.
Referring to fig. 5, fig. 5 is a block diagram of a speech enhancement apparatus according to an embodiment of the present disclosure.
As shown in fig. 5, the apparatus includes:
a calculating module 10, configured to calculate a speech existence probability of a current frame audio signal, where the speech existence probability represents an existence probability of a speech signal in the current frame audio signal;
an obtaining module 20, connected to the calculating module 10, for obtaining the noise variance of the audio signal of the next frame by using the speech existence probability;
a speech enhancement module 30, connected to the obtaining module 20, for performing speech enhancement on the next frame of audio signal by using the noise variance of the next frame of audio signal.
With the above apparatus, the present disclosure calculates a speech existence probability of a current frame audio signal, obtains a noise variance of a next frame audio signal using the speech existence probability, and performs speech enhancement on the next frame audio signal using the noise variance of the next frame audio signal. The method is suitable for the non-stable environment noise with low signal-to-noise ratio, can quickly track the change of noise intensity and update the noise variance in time, can effectively improve the noise suppression level and reduce the loss of the voice signal based on the voice enhancement of the voice existence probability to the next frame of audio signal, thereby ensuring the integrity and intelligibility of the voice signal after noise suppression and improving the voice conversation quality.
In one possible implementation, the calculation module includes:
the first calculation submodule is used for calculating the voice nonexistence probability of the current frame audio signal, and the voice nonexistence probability is the nonexistence probability of the voice signal in the current frame audio signal;
the second calculation submodule is used for calculating the prior signal-to-noise ratio of the current frame audio signal;
and the third calculation submodule is used for calculating the voice existence probability of the current frame audio signal by utilizing the voice nonexistence probability of the current frame audio signal and the prior signal-to-noise ratio.
In a possible implementation, the calculating the speech non-existence probability of the current frame audio signal includes:
calculating the minimum value of the power spectrum of the current frame audio signal according to the power spectrum of the current frame audio signal;
and calculating the voice non-existence probability of the current frame audio signal by using the power spectrum minimum value.
In one possible implementation, the calculating the minimum value of the power spectrum of the current frame audio signal according to the power spectrum of the current frame audio signal includes:
acquiring the minimum value of the power spectrum of the current frame audio signal by using the following formula:
Figure BDA0002055631390000211
wherein S ismin(k, λ) represents the minimum value of the power spectrum of the kth subband of the current λ frame, Smin(k, λ -1) represents the power spectrum minimum of the k-th subband of the λ -1 frame, S (k, λ) represents the power spectrum of the k-th subband of the current λ frame, S (k, λ -1) represents the power spectrum of the k-th subband of the λ -1 frame, α1,α2And β are preset parameters.
In a possible implementation, the calculating the speech non-existence probability of the current frame audio signal by using the power spectrum minimum value includes:
obtaining a first ratio of the current frame audio signal by using the following formula:
Figure BDA0002055631390000212
wherein, γmin(k, λ) represents a first ratio of the kth subband of the current λ frame, Y (k, λ) represents a magnitude spectrum of the kth subband of the current λ frame, Smin(k, λ) represents the power spectrum minimum value of the kth sub-band of the current λ frame, B represents a bias compensation parameter;
obtaining a second ratio of the current frame audio signal according to the following formula:
Figure BDA0002055631390000221
wherein η (k, λ) represents a second ratio of the kth subband of the current λ frame, and S (k, λ) represents a power spectrum of the kth subband of the current λ frame;
and calculating the voice non-existence probability of the current frame signal according to the first ratio and the second ratio.
In a possible implementation manner, the calculating the speech non-existence probability of the current frame signal according to the first ratio and the second ratio includes:
determining the voice non-existence probability by the following formula under the condition that the first ratio is less than or equal to 1 and the second ratio is less than a first preset threshold value:
q (k, λ) ═ 1, where q (k, λ) represents the probability of speech absence for the kth subband of the current λ frame; or
Determining the voice non-existence probability through the following formula under the condition that the first ratio is greater than 1 and less than or equal to a second preset threshold and the second ratio is less than the first preset threshold:
Figure BDA0002055631390000222
wherein, γ1Represents said second preset threshold value, γmin(k, λ) represents the first ratio; or
Determining the voice non-existence probability through the following formula under the condition that the first ratio is greater than or equal to the second preset threshold or the second ratio is greater than or equal to the first preset threshold:
q(k,λ)=0。
in a possible implementation, the calculating the prior signal-to-noise ratio of the current frame audio signal includes:
calculating the posterior signal-to-noise ratio of the current frame audio signal;
calculating the prior signal-to-noise ratio of the current frame audio signal by using the following formula:
ξ (k, λ) ═ ε G (k, λ -1) γ (k, λ -1) + (1- ε) max { γ (k, λ) -1,0}, where ξ (k, λ) represents the prior SNR for the kth subband of the current λ frame, G (k, λ -1) represents the amplitude spectral gain for the kth subband of the λ -1 frame, γ (k, λ -1) represents the A posteriori SNR for the kth subband of the λ -1 frame, γ (k, λ) represents the A posteriori SNR for the kth subband of the current λ frame, and ε is a constant.
In a possible implementation, the calculating the a posteriori signal-to-noise ratio of the current frame audio signal includes:
the a posteriori signal to noise ratio is calculated using the formula:
Figure BDA0002055631390000231
wherein Y (k, λ) represents the amplitude spectrum of the k-th sub-band of the current λ frame,
Figure BDA0002055631390000232
representing the noise variance of the kth subband of the current lambda frame.
In a possible implementation manner, the calculating the speech existence probability of the current frame audio signal by using the speech absence probability and the prior signal-to-noise ratio of the current frame audio signal includes:
calculating the speech existence probability of the current frame audio signal by using the following formula:
Figure BDA0002055631390000233
where p (k, λ) represents the probability of speech being present for the k-th subband of the current λ frame, q (k, λ) represents the probability of speech being absent for the k-th subband of the current λ frame, ξ (k, λ) represents the a priori signal-to-noise ratio for the k-th subband of the current λ frame,
wherein the content of the first and second substances,
Figure BDA0002055631390000234
γ (k, λ) represents the a posteriori signal-to-noise ratio of the k subband of the current λ frame.
In a possible implementation, the obtaining the noise variance of the audio signal of the next frame by using the speech existence probability includes:
the noise variance of the audio signal of the next frame is calculated using the following formula:
Figure BDA0002055631390000241
αD(k,λ)=αd+(1-αd)p(k,λ);
wherein the content of the first and second substances,
Figure BDA0002055631390000242
representing the noise variance of the k-th subband of the lambda +1 frame,
Figure BDA0002055631390000243
representing the noise variance of the k-th subband of the current λ frame, Y (k, λ) representing the magnitude spectrum of the k-th subband of the current λ frame, p (k, λ) representing the probability of speech presence of the k-th subband of the current λ frame, αdIs a constant.
In one possible embodiment, the speech enhancement module includes:
the posterior signal-to-noise ratio obtaining submodule is used for obtaining the posterior signal-to-noise ratio of the next frame of audio signal by utilizing the noise variance of the next frame of audio signal;
the prior signal-to-noise ratio obtaining submodule is used for obtaining the prior signal-to-noise ratio of the next frame of audio signal by utilizing the posterior signal-to-noise ratio of the next frame of audio signal;
the amplitude spectrum gain obtaining submodule is used for obtaining the amplitude spectrum gain of the next frame of audio signal by utilizing the posterior signal-to-noise ratio and the prior signal-to-noise ratio of the next frame of audio signal;
and the voice enhancement submodule is used for carrying out voice enhancement on the next frame of audio signal by utilizing the amplitude spectrum gain of the next frame of audio signal.
In a possible implementation manner, the obtaining the amplitude spectrum gain of the next frame audio signal by using the a posteriori signal-to-noise ratio and the a priori signal-to-noise ratio of the next frame audio signal includes:
obtaining the amplitude spectrum gain of the audio signal of the next frame by using the following formula:
Figure BDA0002055631390000244
where λ +1 frame represents the next frame audio signal, G (k, λ +1) represents the amplitude spectral gain of the k-th sub-band of λ +1 frame, ξ (k, λ +1) represents the a priori signal-to-noise ratio of the k-th sub-band of λ +1 frame,
wherein the content of the first and second substances,
Figure BDA0002055631390000245
γ (k, λ +1) represents the a posteriori signal-to-noise ratio of the k-th sub-band of the λ +1 frame.
It should be noted that the above speech enhancement apparatus is an apparatus corresponding to the aforementioned speech enhancement method, and for the specific description, reference is made to the previous description of the speech enhancement method, which is not repeated herein.
Referring to fig. 6, fig. 6 shows a block diagram of a speech enhancement device according to an embodiment of the present disclosure.
For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.
Referring to fig. 6, the apparatus 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.
The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.
The memory 804 is configured to store various types of data to support operations at the apparatus 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
Power components 806 provide power to the various components of device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 800.
The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.
The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed status of the device 800, the relative positioning of components, such as a display and keypad of the device 800, the sensor assembly 814 may also detect a change in the position of the device 800 or a component of the device 800, the presence or absence of user contact with the device 800, the orientation or acceleration/deceleration of the device 800, and a change in the temperature of the device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.
In an exemplary embodiment, a non-transitory computer-readable storage medium, such as the memory 804, is also provided that includes computer program instructions executable by the processor 820 of the device 800 to perform the above-described methods.
Referring to fig. 7, fig. 7 is a block diagram of a speech enhancement device according to an embodiment of the present disclosure.
For example, the apparatus 1900 may be provided as a server. Referring to fig. 7, the device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by the processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the above-described method.
The device 1900 may also include a power component 1926 configured to perform power management of the device 1900, a wired or wireless network interface 1950 configured to connect the device 1900 to a network, and an input/output (I/O) interface 1958. The device 1900 may operate based on an operating system stored in memory 1932, such as Windows Server, MacOS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.
In an exemplary embodiment, a non-transitory computer readable storage medium, such as the memory 1932, is also provided that includes computer program instructions executable by the processing component 1922 of the apparatus 1900 to perform the above-described methods.
The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.
The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.
The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).
Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terms used herein were chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the techniques in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (26)

1. A method of speech enhancement, the method comprising:
calculating the voice existence probability of the current frame audio signal, wherein the voice existence probability represents the existence probability of the voice signal in the current frame audio signal;
obtaining the noise variance of the next frame of audio signal by using the speech existence probability;
and performing voice enhancement on the next frame of audio signal by using the noise variance of the next frame of audio signal.
2. The method of claim 1, wherein the calculating the speech existence probability of the current frame audio signal comprises:
calculating the voice nonexistence probability of the current frame audio signal, wherein the voice nonexistence probability is the nonexistence probability of the voice signal in the current frame audio signal;
calculating the prior signal-to-noise ratio of the current frame audio signal;
and calculating the voice existence probability of the current frame audio signal by utilizing the voice nonexistence probability of the current frame audio signal and the prior signal-to-noise ratio.
3. The method of claim 2, wherein the calculating the speech absence probability of the current frame audio signal comprises:
calculating the minimum value of the power spectrum of the current frame audio signal according to the power spectrum of the current frame audio signal;
and calculating the voice non-existence probability of the current frame audio signal by using the power spectrum minimum value.
4. The method according to claim 3, wherein said calculating the minimum value of the power spectrum of the current frame audio signal according to the power spectrum of the current frame audio signal comprises:
acquiring the minimum value of the power spectrum of the current frame audio signal by using the following formula:
Figure FDA0002055631380000011
wherein S ismin(k, λ) represents the minimum value of the power spectrum of the kth subband of the current λ frame, Smin(k, λ -1) represents the power spectrum minimum of the k-th subband of the λ -1 frame, S (k, λ) represents the power spectrum of the k-th subband of the current λ frame, S (k, λ -1) represents the power spectrum of the k-th subband of the λ -1 frame, α1,α2And β are preset parameters.
5. The method according to claim 3 or 4, wherein the calculating the speech absence probability of the current frame audio signal by using the power spectrum minimum value comprises:
obtaining a first ratio of the current frame audio signal by using the following formula:
Figure FDA0002055631380000021
wherein, γmin(k, λ) represents a first ratio of the kth subband of the current λ frame, Y (k, λ) represents a magnitude spectrum of the kth subband of the current λ frame, Smin(k, λ) represents the power spectrum minimum value of the kth sub-band of the current λ frame, B represents a bias compensation parameter;
obtaining a second ratio of the current frame audio signal according to the following formula:
Figure FDA0002055631380000022
wherein η (k, λ) represents a second ratio of the kth subband of the current λ frame, and S (k, λ) represents a power spectrum of the kth subband of the current λ frame;
and calculating the voice non-existence probability of the current frame signal according to the first ratio and the second ratio.
6. The method of claim 5, wherein the calculating the speech non-existence probability of the current frame signal according to the first ratio and the second ratio comprises:
determining the voice non-existence probability by the following formula under the condition that the first ratio is less than or equal to 1 and the second ratio is less than a first preset threshold value:
q (k, λ) ═ 1, where q (k, λ) represents the probability of speech absence for the kth subband of the current λ frame; or
Determining the voice non-existence probability through the following formula under the condition that the first ratio is greater than 1 and less than or equal to a second preset threshold and the second ratio is less than the first preset threshold:
Figure FDA0002055631380000031
wherein, γ1Represents said second preset threshold value, γmin(k, λ) represents the first ratio; or
Determining the voice non-existence probability through the following formula under the condition that the first ratio is greater than or equal to the second preset threshold or the second ratio is greater than or equal to the first preset threshold:
q(k,λ)=0。
7. the method of claim 2, wherein the calculating the prior signal-to-noise ratio of the current frame audio signal comprises:
calculating the posterior signal-to-noise ratio of the current frame audio signal;
calculating the prior signal-to-noise ratio of the current frame audio signal by using the following formula:
ξ (k, λ) ═ ε G (k, λ -1) γ (k, λ -1) + (1- ε) max { γ (k, λ) -1,0}, where ξ (k, λ) represents the prior SNR for the kth subband of the current λ frame, G (k, λ -1) represents the amplitude spectral gain for the kth subband of the λ -1 frame, γ (k, λ -1) represents the A posteriori SNR for the kth subband of the λ -1 frame, γ (k, λ) represents the A posteriori SNR for the kth subband of the current λ frame, and ε is a constant.
8. The method according to claim 7, wherein said calculating the a posteriori signal-to-noise ratio of the current frame audio signal comprises:
the a posteriori signal to noise ratio is calculated using the formula:
Figure FDA0002055631380000032
wherein Y (k, λ) represents the amplitude spectrum of the k-th sub-band of the current λ frame,
Figure FDA0002055631380000033
representing the noise variance of the kth subband of the current lambda frame.
9. The method of claim 2, wherein the calculating the speech existence probability of the current frame audio signal by using the speech absence probability of the current frame audio signal and the prior signal-to-noise ratio comprises:
calculating the speech existence probability of the current frame audio signal by using the following formula:
Figure FDA0002055631380000041
where p (k, λ) represents the probability of speech being present for the k-th subband of the current λ frame, q (k, λ) represents the probability of speech being absent for the k-th subband of the current λ frame, ξ (k, λ) represents the a priori signal-to-noise ratio for the k-th subband of the current λ frame,
wherein the content of the first and second substances,
Figure FDA0002055631380000042
γ (k, λ) represents the a posteriori signal-to-noise ratio of the k subband of the current λ frame.
10. The method according to claim 1, wherein the obtaining the noise variance of the audio signal of the next frame by using the speech existence probability comprises:
the noise variance of the audio signal of the next frame is calculated using the following formula:
Figure FDA0002055631380000043
αD(k,λ)=αd+(1-αd)p(k,λ);
wherein the content of the first and second substances,
Figure FDA0002055631380000044
representing the noise variance of the k-th subband of the lambda +1 frame,
Figure FDA0002055631380000045
representing the noise variance of the k-th subband of the current λ frame, Y (k, λ) representing the magnitude spectrum of the k-th subband of the current λ frame, p (k, λ) representing the probability of speech presence of the k-th subband of the current λ frame, αdIs a constant.
11. The method according to claim 1, wherein said performing speech enhancement on the next frame audio signal by using the noise variance of the next frame audio signal comprises:
obtaining the posterior signal-to-noise ratio of the next frame of audio signal by using the noise variance of the next frame of audio signal;
obtaining the prior signal-to-noise ratio of the next frame of audio signal by utilizing the posterior signal-to-noise ratio of the next frame of audio signal;
obtaining the amplitude spectrum gain of the next frame of audio signal by utilizing the posterior signal-to-noise ratio and the prior signal-to-noise ratio of the next frame of audio signal;
and performing voice enhancement on the next frame of audio signal by using the amplitude spectrum gain of the next frame of audio signal.
12. The method according to claim 11, wherein said obtaining the amplitude spectrum gain of the next frame audio signal by using the a posteriori signal-to-noise ratio and the a priori signal-to-noise ratio of the next frame audio signal comprises:
obtaining the amplitude spectrum gain of the audio signal of the next frame by using the following formula:
Figure FDA0002055631380000051
where λ +1 frame represents the next frame audio signal, G (k, λ +1) represents the amplitude spectral gain of the k-th sub-band of λ +1 frame, ξ (k, λ +1) represents the a priori signal-to-noise ratio of the k-th sub-band of λ +1 frame,
wherein the content of the first and second substances,
Figure FDA0002055631380000052
γ (k, λ +1) represents the a posteriori signal-to-noise ratio of the k-th sub-band of the λ +1 frame.
13. A speech enhancement apparatus, characterized in that the apparatus comprises:
the calculating module is used for calculating the voice existence probability of the current frame audio signal, and the voice existence probability represents the existence probability of the voice signal in the current frame audio signal;
the obtaining module is connected with the calculating module and used for obtaining the noise variance of the audio signal of the next frame by utilizing the voice existence probability;
and the voice enhancement module is connected with the obtaining module and used for carrying out voice enhancement on the next frame of audio signal by utilizing the noise variance of the next frame of audio signal.
14. The apparatus of claim 13, wherein the computing module comprises:
the first calculation submodule is used for calculating the voice nonexistence probability of the current frame audio signal, and the voice nonexistence probability is the nonexistence probability of the voice signal in the current frame audio signal;
the second calculation submodule is used for calculating the prior signal-to-noise ratio of the current frame audio signal;
and the third calculation submodule is used for calculating the voice existence probability of the current frame audio signal by utilizing the voice nonexistence probability of the current frame audio signal and the prior signal-to-noise ratio.
15. The apparatus of claim 14, wherein the calculating the speech absence probability of the current frame audio signal comprises:
calculating the minimum value of the power spectrum of the current frame audio signal according to the power spectrum of the current frame audio signal;
and calculating the voice non-existence probability of the current frame audio signal by using the power spectrum minimum value.
16. The apparatus of claim 15, wherein the calculating the minimum value of the power spectrum of the current frame audio signal according to the power spectrum of the current frame audio signal comprises:
acquiring the minimum value of the power spectrum of the current frame audio signal by using the following formula:
Figure FDA0002055631380000061
wherein S ismin(k, λ) represents the minimum value of the power spectrum of the kth subband of the current λ frame, Smin(k, λ -1) represents the power spectrum minimum of the k-th subband of the λ -1 frame, S (k, λ) represents the power spectrum of the k-th subband of the current λ frame, S (k, λ -1) represents the power spectrum of the k-th subband of the λ -1 frame, α1,α2And β are preset parameters.
17. The apparatus according to claim 15 or 16, wherein said calculating the speech absence probability of the current frame audio signal using the power spectrum minimum value comprises:
obtaining a first ratio of the current frame audio signal by using the following formula:
Figure FDA0002055631380000062
wherein, γmin(k, λ) represents a first ratio of the kth subband of the current λ frame, Y (k, λ) represents a magnitude spectrum of the kth subband of the current λ frame, Smin(k, λ) represents the power spectrum minimum value of the kth sub-band of the current λ frame, B represents a bias compensation parameter;
obtaining a second ratio of the current frame audio signal according to the following formula:
Figure FDA0002055631380000071
wherein η (k, λ) represents a second ratio of the kth subband of the current λ frame, and S (k, λ) represents a power spectrum of the kth subband of the current λ frame;
and calculating the voice non-existence probability of the current frame signal according to the first ratio and the second ratio.
18. The apparatus of claim 17, wherein the calculating the speech non-existence probability of the current frame signal according to the first ratio and the second ratio comprises:
determining the voice non-existence probability by the following formula under the condition that the first ratio is less than or equal to 1 and the second ratio is less than a first preset threshold value:
q (k, λ) ═ 1, where q (k, λ) represents the probability of speech absence for the kth subband of the current λ frame; or
Determining the voice non-existence probability through the following formula under the condition that the first ratio is greater than 1 and less than or equal to a second preset threshold and the second ratio is less than the first preset threshold:
Figure FDA0002055631380000072
wherein, γ1Represents said second preset threshold value, γmin(k, λ) represents the first ratio; or
Determining the voice non-existence probability through the following formula under the condition that the first ratio is greater than or equal to the second preset threshold or the second ratio is greater than or equal to the first preset threshold:
q(k,λ)=0。
19. the apparatus of claim 14, wherein said calculating the prior signal-to-noise ratio of the current frame audio signal comprises:
calculating the posterior signal-to-noise ratio of the current frame audio signal;
calculating the prior signal-to-noise ratio of the current frame audio signal by using the following formula:
ξ (k, λ) ═ ε G (k, λ -1) γ (k, λ -1) + (1- ε) max { γ (k, λ) -1,0}, where ξ (k, λ) represents the prior SNR for the kth subband of the current λ frame, G (k, λ -1) represents the amplitude spectral gain for the kth subband of the λ -1 frame, γ (k, λ -1) represents the A posteriori SNR for the kth subband of the λ -1 frame, γ (k, λ) represents the A posteriori SNR for the kth subband of the current λ frame, and ε is a constant.
20. The apparatus according to claim 19, wherein said calculating the a posteriori signal-to-noise ratio of the current frame audio signal comprises:
the a posteriori signal to noise ratio is calculated using the formula:
Figure FDA0002055631380000081
wherein Y (k, λ) represents the amplitude spectrum of the k-th sub-band of the current λ frame,
Figure FDA0002055631380000082
representing the noise variance of the kth subband of the current lambda frame.
21. The apparatus of claim 14, wherein the calculating the speech existence probability of the current frame audio signal using the speech absence probability of the current frame audio signal and the a priori signal-to-noise ratio comprises:
calculating the speech existence probability of the current frame audio signal by using the following formula:
Figure FDA0002055631380000083
where p (k, λ) represents the probability of speech being present for the k-th subband of the current λ frame, q (k, λ) represents the probability of speech being absent for the k-th subband of the current λ frame, ξ (k, λ) represents the a priori signal-to-noise ratio for the k-th subband of the current λ frame,
wherein the content of the first and second substances,
Figure FDA0002055631380000091
γ (k, λ) represents the a posteriori signal-to-noise ratio of the k subband of the current λ frame.
22. The apparatus according to claim 13, wherein the obtaining the noise variance of the audio signal of the next frame by using the speech existence probability comprises:
the noise variance of the audio signal of the next frame is calculated using the following formula:
Figure FDA0002055631380000092
αD(k,λ)=αd+(1-αd)p(k,λ);
wherein the content of the first and second substances,
Figure FDA0002055631380000093
representing the noise variance of the k-th subband of the lambda +1 frame,
Figure FDA0002055631380000094
representing the noise variance of the k-th subband of the current λ frame, Y (k, λ) representing the magnitude spectrum of the k-th subband of the current λ frame, p (k, λ) representing the probability of speech presence of the k-th subband of the current λ frame, αdIs a constant.
23. The apparatus of claim 13, wherein the speech enhancement module comprises:
the posterior signal-to-noise ratio obtaining submodule is used for obtaining the posterior signal-to-noise ratio of the next frame of audio signal by utilizing the noise variance of the next frame of audio signal;
the prior signal-to-noise ratio obtaining submodule is used for obtaining the prior signal-to-noise ratio of the next frame of audio signal by utilizing the posterior signal-to-noise ratio of the next frame of audio signal;
the amplitude spectrum gain obtaining submodule is used for obtaining the amplitude spectrum gain of the next frame of audio signal by utilizing the posterior signal-to-noise ratio and the prior signal-to-noise ratio of the next frame of audio signal;
and the voice enhancement submodule is used for carrying out voice enhancement on the next frame of audio signal by utilizing the amplitude spectrum gain of the next frame of audio signal.
24. The apparatus according to claim 23, wherein said obtaining the amplitude spectrum gain of the next frame audio signal by using the a posteriori signal-to-noise ratio and the a priori signal-to-noise ratio of the next frame audio signal comprises:
obtaining the amplitude spectrum gain of the audio signal of the next frame by using the following formula:
Figure FDA0002055631380000101
where λ +1 frame represents the next frame audio signal, G (k, λ +1) represents the amplitude spectral gain of the k-th sub-band of λ +1 frame, ξ (k, λ +1) represents the a priori signal-to-noise ratio of the k-th sub-band of λ +1 frame,
wherein the content of the first and second substances,
Figure FDA0002055631380000102
γ (k, λ +1) represents the a posteriori signal-to-noise ratio of the k-th sub-band of the λ +1 frame.
25. A speech enhancement apparatus, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to:
performing the method of any one of claims 1-12.
26. A non-transitory computer readable storage medium having stored thereon computer program instructions, wherein the computer program instructions, when executed by a processor, implement the method of any one of claims 1 to 12.
CN201910388459.6A 2019-05-10 2019-05-10 Voice enhancement method and device Active CN111292761B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910388459.6A CN111292761B (en) 2019-05-10 2019-05-10 Voice enhancement method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910388459.6A CN111292761B (en) 2019-05-10 2019-05-10 Voice enhancement method and device

Publications (2)

Publication Number Publication Date
CN111292761A true CN111292761A (en) 2020-06-16
CN111292761B CN111292761B (en) 2023-04-14

Family

ID=71028306

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910388459.6A Active CN111292761B (en) 2019-05-10 2019-05-10 Voice enhancement method and device

Country Status (1)

Country Link
CN (1) CN111292761B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111986691A (en) * 2020-09-04 2020-11-24 腾讯科技(深圳)有限公司 Audio processing method and device, computer equipment and storage medium
CN113973250A (en) * 2021-10-26 2022-01-25 恒玄科技(上海)股份有限公司 Noise suppression method and device and auxiliary listening earphone

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101853666A (en) * 2009-03-30 2010-10-06 华为技术有限公司 Speech enhancement method and device
CN103903629A (en) * 2012-12-28 2014-07-02 联芯科技有限公司 Noise estimation method and device based on hidden Markov model

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101853666A (en) * 2009-03-30 2010-10-06 华为技术有限公司 Speech enhancement method and device
CN103903629A (en) * 2012-12-28 2014-07-02 联芯科技有限公司 Noise estimation method and device based on hidden Markov model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
蒋建中等: "一种新的强噪声环境下的语音增强算法", 《计算机工程与应用》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111986691A (en) * 2020-09-04 2020-11-24 腾讯科技(深圳)有限公司 Audio processing method and device, computer equipment and storage medium
CN111986691B (en) * 2020-09-04 2024-02-02 腾讯科技(深圳)有限公司 Audio processing method, device, computer equipment and storage medium
CN113973250A (en) * 2021-10-26 2022-01-25 恒玄科技(上海)股份有限公司 Noise suppression method and device and auxiliary listening earphone
CN113973250B (en) * 2021-10-26 2023-12-08 恒玄科技(上海)股份有限公司 Noise suppression method and device and hearing-aid earphone

Also Published As

Publication number Publication date
CN111292761B (en) 2023-04-14

Similar Documents

Publication Publication Date Title
CN109361828B (en) Echo cancellation method and device, electronic equipment and storage medium
CN111128221B (en) Audio signal processing method and device, terminal and storage medium
CN111883164B (en) Model training method and device, electronic equipment and storage medium
CN111968662A (en) Audio signal processing method and device and storage medium
CN110931028B (en) Voice processing method and device and electronic equipment
CN111292761B (en) Voice enhancement method and device
CN111986693A (en) Audio signal processing method and device, terminal equipment and storage medium
CN109256145B (en) Terminal-based audio processing method and device, terminal and readable storage medium
CN110675355B (en) Image reconstruction method and device, electronic equipment and storage medium
CN111583142A (en) Image noise reduction method and device, electronic equipment and storage medium
CN112201267A (en) Audio processing method and device, electronic equipment and storage medium
CN109068138B (en) Video image processing method and device, electronic equipment and storage medium
CN111667842B (en) Audio signal processing method and device
CN113077808B (en) Voice processing method and device for voice processing
CN112651880B (en) Video data processing method and device, electronic equipment and storage medium
CN111294473B (en) Signal processing method and device
CN113223553B (en) Method, apparatus and medium for separating voice signal
CN107481734B (en) Voice quality evaluation method and device
CN110580910A (en) Audio processing method, device and equipment and readable storage medium
CN111583145B (en) Image noise reduction method and device, electronic equipment and storage medium
CN113489854B (en) Sound processing method, device, electronic equipment and storage medium
CN113345456B (en) Echo separation method, device and storage medium
CN113113036B (en) Audio signal processing method and device, terminal and storage medium
CN111292760B (en) Sounding state detection method and user equipment
CN111292250B (en) Image processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant