CN114360560A - Speech enhancement post-processing method and device based on harmonic structure prediction - Google Patents

Speech enhancement post-processing method and device based on harmonic structure prediction Download PDF

Info

Publication number
CN114360560A
CN114360560A CN202210049231.6A CN202210049231A CN114360560A CN 114360560 A CN114360560 A CN 114360560A CN 202210049231 A CN202210049231 A CN 202210049231A CN 114360560 A CN114360560 A CN 114360560A
Authority
CN
China
Prior art keywords
time
estimation
spectral density
power spectral
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210049231.6A
Other languages
Chinese (zh)
Inventor
何平
蒋升
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suirui Technology Group Co Ltd
Original Assignee
Suirui Technology Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suirui Technology Group Co Ltd filed Critical Suirui Technology Group Co Ltd
Priority to CN202210049231.6A priority Critical patent/CN114360560A/en
Publication of CN114360560A publication Critical patent/CN114360560A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)

Abstract

The invention discloses a speech enhancement post-processing method and a speech enhancement post-processing device based on harmonic structure prediction, which belong to the field of information processing, and comprise the following steps: s1: carrying out short-time Fourier transform on a voice signal of a microphone to obtain a time-frequency domain expression; s2: carrying out harmonic loss estimation and correction on the time-frequency domain signal to obtain estimated power spectral density; s3: estimating a time-frequency masking value according to the power spectral density; s4: and according to the estimated time-frequency masking value, acquiring frequency domain estimation of the target voice, and further acquiring time domain estimation of the target voice. The invention can predict the lost harmonic structure to a certain extent, the recovered voice is more in line with the characteristics of near-speaking voice, and the intelligibility and the voice perception quality are higher.

Description

Speech enhancement post-processing method and device based on harmonic structure prediction
Technical Field
The invention belongs to the field of information processing, and particularly relates to a speech enhancement post-processing method and device based on harmonic structure prediction.
Background
Background noise can degrade the communication quality of a speech system in many applications, such as voice conferencing systems. The suppression of the noise signal collected by the microphone by the algorithm is one of the key technologies necessary for the conference system related application. However, the noise suppression method suppresses noise and also damages the voice signal. Therefore, it is also necessary to consider how to enhance the speech signal, especially the harmonic structure of speech, while suppressing noise.
In the prior art, noise suppression and voice enhancement are key technologies of voice communication quality in a conference system or conference equipment. The conventional signal processing method is to track the noise power spectral density and the voice power spectral density in the signal and then construct a masking value of 0 to 1 in the frequency domain based on wiener filtering. After the original signal is masked, the purpose of inhibiting background noise is achieved. In order to overcome the defect that the traditional signal processing method is ineffective to non-stationary noise, the technology for performing time-frequency masking estimation based on deep learning is more and more mature and is more and more applied. The main idea is to estimate the time-frequency masking value directly from the mixed signal by training the noisy data set to the clean speech signal. At present, the effect of noise suppression based on deep learning is superior to that of the traditional signal processing method. Both traditional signal processing methods and deep learning methods present a potential risk of speech distortion. Because the energy of the voice signal is mainly distributed on the harmonic structure, how to enhance the voice signal through the prediction of the harmonic structure has great significance for improving the voice communication quality.
At present, the main disadvantages of the methods for estimating the time-frequency masking value in the prior art are as follows: 1. the existing time-frequency masking method is easy to ignore the harmonic structure of voice, so that partial harmonic is lost and the communication quality is influenced; 2. under a far-distance speaking scene, high-frequency harmonics can be weakened due to the influence of reverberation on voice, and the weakened high-frequency harmonics cannot be recovered by the conventional time-frequency masking method.
The information disclosed in this background section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.
Disclosure of Invention
The invention aims to provide a speech enhancement post-processing method and a speech enhancement post-processing device based on harmonic structure prediction, which can predict a lost harmonic structure to a certain extent, and the recovered speech is more consistent with the characteristics of near-speaking speech and has higher intelligibility and speech perception quality.
In order to achieve the above object, the present invention provides a speech enhancement post-processing method based on harmonic structure prediction, which includes the following steps:
s1: carrying out short-time Fourier transform on a voice signal of a microphone to obtain a time-frequency domain expression;
s2: carrying out harmonic loss estimation and correction on the time-frequency domain signal to obtain estimated power spectral density;
s3: estimating a time-frequency masking value according to the power spectral density;
s4: and according to the estimated time-frequency masking value, acquiring frequency domain estimation of the target voice, and further acquiring time domain estimation of the target voice.
In an embodiment of the present invention, the step S1 is preceded by: acquiring a voice signal x (n) of a microphone;
the short-time fourier transform of the speech signal x (n) of the microphone in step S1 is as follows:
Figure BDA0003473903570000021
in an embodiment of the present invention, the step S2 specifically includes the following steps:
s201: calculating the masked time domain signal by adopting the time frequency masking value M (l, k) estimated by deep learning
Figure BDA0003473903570000022
The specific calculation process is as follows:
Figure BDA0003473903570000023
s202: for the masked time domain signal
Figure BDA0003473903570000024
Performing half-wave rectificationAfter the flow, a Fourier transform is performed to obtain an estimate of the harmonic loss
Figure BDA0003473903570000025
The specific calculation formula is as follows:
Figure BDA0003473903570000031
Figure BDA0003473903570000032
s203: for the harmonic loss estimation correction, the correction process is as follows:
Figure BDA0003473903570000033
s204: for each frequency band k, estimating the power spectral density by adopting a uniform smoothing factor alpha; wherein, the power spectral density comprises the power spectral density of background noise, the power spectral density of time-frequency masked voice and the power spectral density of harmonic loss;
the power spectral density estimation process is as follows:
ρv(k)=αρv(k)+(1-α)(1-M(l,k))|X(l,k)|2
ρs(k)=αρs(k)+(1-α)M(l,k)|X(l,k)|2
Figure BDA0003473903570000034
in an embodiment of the present invention, in the step S3, for each frequency band k, a time-frequency masking value G (l, k) is estimated, and the estimation process is as follows:
Figure BDA0003473903570000035
in one embodiment of the present invention, in the step S4, the frequency domain estimation of the target speech
Figure BDA0003473903570000036
The acquisition process is as follows:
y^(″l,k″)=G(l,k)X(l,k),
and then, obtaining a target voice time domain estimation through inverse Fourier transform, wherein the process is as follows:
Figure BDA0003473903570000037
the invention also provides a speech enhancement post-processing device based on harmonic structure prediction, which comprises a signal decomposition module, a loss estimation module, a masking calculation module and a speech synthesis module, wherein the signal decomposition module comprises a harmonic structure prediction module, a harmonic structure prediction module and a harmonic structure prediction module, and the harmonic structure prediction module comprises a harmonic structure prediction module, a harmonic structure prediction module and a harmonic structure prediction module, wherein the harmonic structure prediction module comprises a harmonic structure prediction module, a harmonic structure prediction module and a harmonic structure prediction module, and the harmonic structure prediction module comprises a harmonic structure prediction module, a harmonic structure prediction module and a harmonic structure prediction module, wherein the harmonic structure prediction module comprises a harmonic structure:
the signal decomposition module is used for carrying out short-time Fourier transform on the voice signal of the microphone to obtain time-frequency domain expression;
the loss estimation module is used for carrying out harmonic loss estimation and correction on the time-frequency domain signal to obtain estimated power spectral density; the harmonic loss compensation method comprises a masking calculation module, a harmonic loss calculation module, a loss estimation correction module and a power spectral density calculation module;
the masking calculation module is used for estimating a time-frequency masking value according to the power spectral density;
and the voice synthesis module is used for acquiring the frequency domain estimation of the target voice according to the estimated time-frequency masking value so as to obtain the time domain estimation of the target voice.
In an embodiment of the present invention, the signal decomposition module is further configured to obtain a speech signal x (n) of a microphone;
in the signal decomposition module, the process of performing short-time fourier transform on the speech signal x (n) of the microphone is as follows:
Figure BDA0003473903570000041
in one embodiment of the present inventionThe masking calculation module is configured to calculate a masked time-domain signal by using a time-frequency masking value M (l, k) estimated by deep learning
Figure BDA0003473903570000042
The specific calculation process is as follows:
Figure BDA0003473903570000043
the harmonic loss calculation module is used for calculating the masked time domain signal
Figure BDA0003473903570000044
After half-wave rectification, Fourier transform is carried out to obtain estimation of harmonic loss
Figure BDA0003473903570000045
The specific calculation formula is as follows:
Figure BDA0003473903570000046
Figure BDA0003473903570000047
the loss estimation and correction module is used for carrying out harmonic loss estimation and correction, and the correction process is as follows:
Figure BDA0003473903570000048
the power spectral density calculating module is used for estimating the power spectral density of each frequency band k by adopting a unified smoothing factor alpha; wherein the power spectral density comprises the power spectral density of background noise, the power spectral density of time-frequency masked voice and the power spectral density of harmonic loss,
the power spectral density estimation process is as follows:
ρv(k)=αρv(k)+(1-α)(1-M(l,k))|X(l,k)|2
ρs(k)=αρs(k)+(1-α)M(l,k)|X(l,k)|2
Figure BDA0003473903570000051
in an embodiment of the present invention, the masking calculation module estimates a time-frequency masking value G (l, k) for each frequency band k, and the estimation process is as follows:
Figure BDA0003473903570000052
in an embodiment of the present invention, in the speech synthesis module, the frequency domain estimation of the target speech
Figure BDA0003473903570000053
The acquisition process is as follows:
y^(″l,k″)=G(l,k)X(l,k),
and obtaining a target voice time domain estimation through inverse Fourier transform, wherein the process is as follows:
Figure BDA0003473903570000054
the invention provides a speech enhancement post-processing method and a speech enhancement post-processing device based on harmonic structure prediction, which have the following beneficial effects:
1. according to the method and the device, the lost harmonic component of the time-frequency masking information obtained based on deep learning is estimated, the more lost harmonic component is obtained, new time-frequency masking information is constructed, the harmonic component can be effectively recovered, and the voice perception quality is improved.
2. The invention adopts novel time-frequency masking estimation, considers the specific requirements of voice communication, considers noise suppression, recovers the lost harmonic waves to a certain degree and has better communication quality.
Drawings
Fig. 1 is a flowchart of a speech enhancement post-processing method based on harmonic structure prediction in the present embodiment.
Fig. 2 is a diagram of a hamming window function used in this embodiment.
Fig. 3 is a schematic diagram of a speech enhancement post-processing device based on harmonic structure prediction according to the present embodiment.
Detailed Description
The present invention will be described in further detail with reference to specific embodiments in order to make the technical field better understand the scheme of the present invention.
As shown in fig. 1, an embodiment of the present invention is a speech enhancement post-processing method based on harmonic structure prediction.
The method specifically comprises the following four implementation steps:
s1: and carrying out short-time Fourier transform on the voice signal of the microphone to obtain a time-frequency domain expression.
The voice signal of the microphone is a digital signal obtained after sound pressure collected by the microphone passes through the ADC.
Before step S1, the method further includes acquiring a voice signal of the microphone, where the acquired voice signal is as follows: let x (n) represent the original time domain signal picked up by the microphone element in real time, where n represents the time tag.
Specifically, the process of performing short-time fourier transform on the speech signal x (n) of the microphone to obtain the time-frequency domain expression is as follows:
Figure BDA0003473903570000061
wherein, N is the frame length, and N is 512; w (n) is a Hamming window of length 512, where n represents a time stamp, i.e., a time sequence number, and thus w (n) represents the value at each corresponding time sequence number n; 1 is a time frame sequence number, and takes a frame as a unit; k is a frequency band number, wherein a frequency band refers to a signal component corresponding to a certain frequency; j representsUnit of imaginary number
Figure BDA0003473903570000062
X (l, k) is the speech signal of the mth microphone, and in the mth frame, the frequency spectrum of the kth frequency band.
The hamming window function used in the present invention is shown in fig. 2.
Through the above step S1, the time domain signal of the voice signal of the microphone can be converted into a time-frequency domain signal due to the processing of the subsequent steps.
S2: and performing harmonic loss estimation and correction on the time-frequency domain signal to obtain estimated power spectral density.
Specifically, the time-frequency masking value of deep learning estimation is adopted to calculate the masked time-domain signal, and the harmonic loss is estimated and corrected, so that the power spectral density is estimated.
Specifically, the present step S2 includes the steps of:
s201: calculating the masked time domain signal by adopting the time frequency masking value M (l, k) estimated by deep learning
Figure BDA0003473903570000071
The specific calculation process is as follows:
Figure BDA0003473903570000072
the time-frequency masking M (l, k) is a common scheme for speech enhancement based on deep learning.
The time domain signal output in step S201 has a certain harmonic loss, that is, energy in a part of frequency band of the periodic harmonic frequency band (fundamental frequency and its multiple frequency) is significantly attenuated, but most of the harmonic structure is preserved, which can be used for calculating the lost harmonic structure in the subsequent step.
Unlike step S1, step S201 does not require a windowing function in the present inverse fourier transform.
S202: for the masked time domain signal
Figure BDA0003473903570000073
After half-wave rectification, Fourier transform is carried out to obtain estimation of harmonic loss
Figure BDA0003473903570000074
The specific calculation formula is as follows:
Figure BDA0003473903570000075
Figure BDA0003473903570000076
where sign () represents a half-wave rectification operation.
In step S202, smoothing of adjacent harmonic frequency bands can be achieved by half-wave rectification and then fourier transform, and damaged harmonics can be partially recovered. In the estimation of the harmonic loss obtained
Figure BDA0003473903570000077
If there is a certain under-estimation, correction will be performed in the subsequent step S203.
S203: carrying out harmonic loss estimation correction, wherein the correction process is as follows:
Figure BDA0003473903570000081
the harmonic loss estimated in the previous step S202 has a certain error, and the result of the estimation can be corrected in this step S203.
The principle of the correction method adopted in step S203 is that a time-frequency masking method is adopted, and the masking value M (l, k) is always smaller than 1. Therefore, if estimation of harmonic loss
Figure BDA0003473903570000082
Is less than its original value Y (l, k), indicating that the harmonics are not recovered, and the original estimate is still used as a maskThe mask value. The result of this step S203 can be used to update the power spectral density and to update the time-frequency mask estimate in subsequent steps.
S204: and estimating the power spectral density by adopting a uniform smoothing factor alpha for each frequency band k, wherein the power spectral density comprises the power spectral density of background noise, the power spectral density of time-frequency masked voice and the power spectral density of harmonic loss.
The power spectral density estimation process is as follows:
ρv(k)=αρv(k)+(1-α)(1-M(l,k))|X(l,k)|2
ρs(k)=αρs(k)+(1-α)M(l,k)|X(l,k)|2
Figure BDA0003473903570000083
where ρ isv(k)、ρs(k) And ρh(k) Respectively representing the power spectral density of background noise, voice after time-frequency masking and harmonic loss. Rhoh(k) It can be understood as the average result of harmonic loss over a period of time, hence using ph(k) Harmonic recovery is performed with better robustness. Alpha is a smoothing factor between adjacent frames, the value range is between 0 and 1, if the value is too small, the power spectral density estimation change amplitude is too large, and the defect of instability exists, and if the value is too high, the energy estimation is too stable, and the modeling capability of a non-stable signal is reduced. The invention preferably has alpha of 0.95, and can balance stability and capability of modeling non-stationary noise.
The output structure of this step S204 can be used in subsequent steps to construct masking values that can restore the harmonic structure.
S3: and estimating a time-frequency masking value according to the power spectral density.
Specifically, for each frequency band k, a time-frequency masking value G (l, k) is estimated, and the estimation process is as follows:
Figure BDA0003473903570000091
where max () represents taking the larger of the two, in this step S3,
Figure BDA0003473903570000092
the harmonic structure is enhanced, and the calculated time-frequency masking value is used for obtaining the voice frequency spectrum estimation in the subsequent step.
S4: and according to the estimated time-frequency masking value, acquiring frequency domain estimation of the target voice, and further acquiring time domain estimation of the target voice.
Wherein the frequency domain estimation of the target speech
Figure BDA0003473903570000093
The acquisition process is as follows:
y^(″l,k″)=G(l,k)X(l,k)
and then, obtaining a target voice time domain estimation through inverse Fourier transform, wherein the process is as follows:
Figure BDA0003473903570000094
through the step S4, the time domain estimated signal can be directly converted into a voltage signal through digital-to-analog conversion, and the enhanced voice signal is played by a speaker.
Through the steps S1-S4, signal time-frequency decomposition, harmonic loss estimation, time-frequency masking calculation and target voice synthesis can be realized, and finally, the voice communication quality is improved.
As shown in fig. 3, an embodiment of the present invention is a speech enhancement post-processing apparatus based on harmonic structure prediction, and includes a signal decomposition module 1, a loss estimation module 2, a masking calculation module 3, and a speech synthesis module 4.
And the signal decomposition module 1 is used for carrying out short-time Fourier transform on the voice signal of the microphone to obtain a time-frequency domain expression.
The signal decomposition module 1 can also be used to obtain a speech signal of the microphone and an echo reference signal, where the obtained speech signal is as follows: let x (n) represent the original time domain signal picked up by the microphone element in real time, where n represents the time tag.
Specifically, the method of performing the short-time fourier transform is as follows:
carrying out short-time Fourier transform on the time domain signal x (n) to obtain a time-frequency domain expression:
Figure BDA0003473903570000101
wherein, N is the frame length, and N is 512; w (n) is a Hamming window of length 512, where n represents a time stamp, i.e., a time sequence number, and thus w (n) represents the value at each corresponding time sequence number n; 1 is a time frame sequence number, and takes a frame as a unit; k is a frequency band number, wherein a frequency band refers to a signal component corresponding to a certain frequency; j represents an imaginary unit
Figure BDA0003473903570000102
X (l, k) is the speech signal of the mth microphone, and in the 1 st frame, the frequency spectrum of the kth frequency band.
The hamming window function used in the present invention is shown in fig. 2.
The time domain signal of the voice signal of the microphone can be converted into a time-frequency domain signal by the signal decomposition module 1.
And the loss estimation module 2 is configured to perform harmonic loss estimation on the time-frequency domain signal, estimate a loss harmonic component in the time-frequency domain signal, and obtain an estimated power spectral density. The loss estimation module 2 comprises a masking calculation module, a harmonic loss calculation module, a loss estimation correction module and a power spectral density calculation module.
Specifically, the masking calculation module is configured to calculate a masked time domain signal with a time-frequency masking value M (l, k) estimated by deep learning
Figure BDA0003473903570000103
The specific calculation process is as follows:
Figure BDA0003473903570000104
the time domain signal output by the masking calculation module has a certain harmonic loss, namely, the energy on a part of frequency bands (fundamental frequency and frequency multiplication) of the periodic harmonic frequency band is obviously weakened, but most of harmonic structures are reserved and can be used for calculating the lost harmonic structures subsequently.
The masking computation module differs from the signal decomposition module 1 in that the inverse fourier transform does not require a windowing function.
Harmonic loss calculation module for masking time domain signal
Figure BDA0003473903570000105
After half-wave rectification, Fourier transform is carried out to obtain estimation of harmonic loss
Figure BDA0003473903570000106
The specific calculation process is as follows:
Figure BDA0003473903570000107
Figure BDA0003473903570000108
where sign () represents a half-wave rectification operation.
The harmonic loss calculation module can realize the smoothing of adjacent harmonic frequency bands by a mode of half-wave rectification and then Fourier transform, and can partially recover damaged harmonics. In the obtained estimation of the harmonic loss, there is a certain under-estimation, and the correction will be performed subsequently.
The loss estimation and correction module is used for carrying out harmonic loss estimation and correction, and the correction process is as follows:
Figure BDA0003473903570000111
the harmonic loss estimated in the harmonic loss calculation module has certain error, and the loss estimation correction module can correct the estimated result.
The principle of the correction method adopted by the loss estimation correction module is that a time-frequency masking method is adopted, and the masking value M (l, k) is always smaller than 1. Therefore, if estimation of harmonic loss
Figure BDA0003473903570000112
The value of (c) is less than its original value X (l, k), indicating that the harmonics are not recovered, and the original estimate is still used as the masking value. The results of the loss estimation correction module can be used to subsequently update the power spectral density and update the time-frequency mask estimate.
And the power spectral density calculation module is used for estimating the power spectral density by adopting a uniform smoothing factor alpha for each frequency band k, wherein the power spectral density comprises the power spectral density of background noise, the power spectral density of the time-frequency masked voice and the power spectral density of harmonic loss.
The power spectral density estimation process is as follows:
ρv(k)=αρv(k)+(1-α)(1-M(l,k))|X(l,k)|2
ρs(k)=αρs(k)+(1-α)M(l,k)|X(l,k)|2
Figure BDA0003473903570000113
where ρ isv(k)、ρs(k) And ρh(k) Respectively representing the power spectral density of background noise, voice after time-frequency masking and harmonic loss. Rhoh(k) It can be understood as the average result of harmonic loss over a period of time, hence using ph(k) Harmonic recovery is performed with better robustness. Alpha is a smoothing factor between adjacent frames and has a value ranging between 0 and 1. The invention preferably selects a to 0.95, if the value is too small, the power spectral density estimation will change too much, and there is a defect of instability, if the value is too high,the energy estimate is too stationary and the ability to model non-stationary signals is reduced.
And the separation matrix calculation module 3 is used for estimating the time-frequency masking value according to the power spectral density.
Specifically, for each frequency band k, the time-frequency masking value G (l, k) is estimated, and the estimation process is as follows:
Figure BDA0003473903570000121
where max () represents taking the large of the two. In the separation matrix calculation block 3, the separation matrix,
Figure BDA0003473903570000122
the harmonic structure is enhanced, and the calculated time-frequency masking value is used for obtaining the voice frequency spectrum estimation in the subsequent step.
And the voice synthesis module 4 is configured to obtain a frequency domain estimation of the target voice according to the estimated time-frequency masking value, and further obtain a time domain estimation of the target voice.
Wherein the frequency domain estimation of the target speech
Figure BDA0003473903570000123
The acquisition process is as follows:
y^(″l,k″)=G(l,k)X(l,k)
and then, obtaining a target voice time domain estimation through inverse Fourier transform, wherein the process is as follows:
Figure BDA0003473903570000124
through the speech synthesis module 4, the time domain estimated signal can be directly converted into a voltage signal through digital-to-analog conversion, and the enhanced speech signal is played by a loudspeaker.
The speech enhancement post-processing device based on harmonic structure prediction can realize signal time-frequency decomposition, harmonic loss estimation, time-frequency masking calculation and target speech synthesis, and finally improve the speech communication quality.
The foregoing descriptions of specific exemplary embodiments of the present invention have been presented for purposes of illustration and description. It is not intended to limit the invention to the precise form disclosed, and obviously many modifications and variations are possible in light of the above teaching. The exemplary embodiments were chosen and described in order to explain certain principles of the invention and its practical application to enable one skilled in the art to make and use various exemplary embodiments of the invention and various alternatives and modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims and their equivalents.

Claims (10)

1. A speech enhancement post-processing method based on harmonic structure prediction is characterized by comprising the following steps:
s1: carrying out short-time Fourier transform on a voice signal of a microphone to obtain a time-frequency domain expression;
s2: carrying out harmonic loss estimation and correction on the time-frequency domain signal to obtain estimated power spectral density;
s3: estimating a time-frequency masking value according to the power spectral density;
s4: and according to the estimated time-frequency masking value, acquiring frequency domain estimation of the target voice, and further acquiring time domain estimation of the target voice.
2. The method for speech enhancement post-processing based on harmonic structure prediction according to claim 1, wherein said step S1 is preceded by the steps of: acquiring a voice signal x (n) of a microphone;
the short-time fourier transform of the speech signal x (n) of the microphone in step S1 is as follows:
Figure FDA0003473903560000011
3. the harmonic structure prediction-based speech enhancement post-processing method according to claim 2, wherein the step S2 specifically comprises the following steps:
s201: calculating the masked time domain signal by adopting the time frequency masking value M (l, k) estimated by deep learning
Figure FDA0003473903560000012
The specific calculation process is as follows:
Figure FDA0003473903560000013
s202: for the masked time domain signal
Figure FDA0003473903560000014
After half-wave rectification, Fourier transform is carried out to obtain estimation of harmonic loss
Figure FDA0003473903560000015
The specific calculation formula is as follows:
Figure FDA0003473903560000016
Figure FDA0003473903560000017
s203: for the harmonic loss estimation correction, the correction process is as follows:
Figure FDA0003473903560000021
s204: for each frequency band k, estimating the power spectral density by adopting a uniform smoothing factor alpha; wherein, the power spectral density comprises the power spectral density of background noise, the power spectral density of time-frequency masked voice and the power spectral density of harmonic loss;
the power spectral density estimation process is as follows:
ρv(k)=αρv(k)+(1-α)(1-M(l,k))|X(l,k)|2
ρs(k)=αρs(k)+(1-α)M(l,k)|X(l,k)|2
Figure FDA0003473903560000022
4. the harmonic structure prediction based speech enhancement post-processing method according to claim 3, wherein in step S3, for each frequency band k, the time-frequency masking value G (l, k) is estimated as follows:
Figure FDA0003473903560000023
5. the method for speech enhancement post-processing based on harmonic structure prediction according to claim 4, characterized in that in step S4, the frequency domain estimation of the target speech is
Figure FDA0003473903560000024
The acquisition process is as follows:
y^(″l,k″)=G(l,k)X(l,k),
and then, obtaining a target voice time domain estimation through inverse Fourier transform, wherein the process is as follows:
Figure FDA0003473903560000025
6. a speech enhancement post-processing device based on harmonic structure prediction is characterized by comprising a signal decomposition module, a loss estimation module, a masking calculation module and a speech synthesis module:
the signal decomposition module is used for carrying out short-time Fourier transform on the voice signal of the microphone to obtain time-frequency domain expression;
the loss estimation module is used for carrying out harmonic loss estimation and correction on the time-frequency domain signal to obtain estimated power spectral density; the harmonic loss compensation method comprises a masking calculation module, a harmonic loss calculation module, a loss estimation correction module and a power spectral density calculation module;
the masking calculation module is used for estimating a time-frequency masking value according to the power spectral density;
and the voice synthesis module is used for acquiring the frequency domain estimation of the target voice according to the estimated time-frequency masking value so as to obtain the time domain estimation of the target voice.
7. The harmonic structure prediction-based speech enhancement post-processing apparatus according to claim 6, wherein the signal decomposition module is further configured to obtain a speech signal x (n) of a microphone;
in the signal decomposition module, the process of performing short-time fourier transform on the speech signal x (n) of the microphone is as follows:
Figure FDA0003473903560000031
8. the harmonic structure prediction-based speech enhancement post-processing device according to claim 7, wherein the masking calculation module is configured to calculate the masked time-domain signal using the time-frequency masking value M (l, k) estimated by deep learning
Figure FDA0003473903560000032
The specific calculation process is as follows:
Figure FDA0003473903560000033
the harmonic loss calculation module is used for calculating the masked harmonic lossOf the time domain signal
Figure FDA0003473903560000034
After half-wave rectification, Fourier transform is carried out to obtain estimation of harmonic loss
Figure FDA0003473903560000035
The specific calculation formula is as follows:
Figure FDA0003473903560000036
Figure FDA0003473903560000037
the loss estimation and correction module is used for carrying out harmonic loss estimation and correction, and the correction process is as follows:
Figure FDA0003473903560000041
the power spectral density calculating module is used for estimating the power spectral density of each frequency band k by adopting a unified smoothing factor alpha; wherein the power spectral density comprises the power spectral density of background noise, the power spectral density of time-frequency masked voice and the power spectral density of harmonic loss,
the power spectral density estimation process is as follows:
ρv(k)=αρv(k)+(1-α)(1-M(l,k))|X(l,k)|2
ρs(k)=αρs(k)+(1-α)M(l,k)|X(l,k)|2
Figure FDA0003473903560000042
9. the harmonic structure prediction based speech enhancement post-processing device according to claim 8, wherein the masking computation module estimates a time-frequency masking value G (l, k) for each frequency band k by the following estimation process:
Figure FDA0003473903560000043
10. the harmonic structure prediction based speech enhancement post-processing apparatus as claimed in claim 9, wherein in the speech synthesis module, the frequency domain estimation of the target speech
Figure FDA0003473903560000044
The acquisition process is as follows:
y^(″l,k″)=G(l,k)X(l,k),
and obtaining a target voice time domain estimation through inverse Fourier transform, wherein the process is as follows:
Figure FDA0003473903560000045
CN202210049231.6A 2022-01-17 2022-01-17 Speech enhancement post-processing method and device based on harmonic structure prediction Pending CN114360560A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210049231.6A CN114360560A (en) 2022-01-17 2022-01-17 Speech enhancement post-processing method and device based on harmonic structure prediction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210049231.6A CN114360560A (en) 2022-01-17 2022-01-17 Speech enhancement post-processing method and device based on harmonic structure prediction

Publications (1)

Publication Number Publication Date
CN114360560A true CN114360560A (en) 2022-04-15

Family

ID=81092145

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210049231.6A Pending CN114360560A (en) 2022-01-17 2022-01-17 Speech enhancement post-processing method and device based on harmonic structure prediction

Country Status (1)

Country Link
CN (1) CN114360560A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115862656A (en) * 2023-02-03 2023-03-28 中国科学院自动化研究所 Method, device, equipment and storage medium for enhancing bone-conduction microphone voice

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115862656A (en) * 2023-02-03 2023-03-28 中国科学院自动化研究所 Method, device, equipment and storage medium for enhancing bone-conduction microphone voice

Similar Documents

Publication Publication Date Title
CN108831499B (en) Speech enhancement method using speech existence probability
US11056130B2 (en) Speech enhancement method and apparatus, device and storage medium
CN108735213B (en) Voice enhancement method and system based on phase compensation
CN109584903B (en) Multi-user voice separation method based on deep learning
CN111081268A (en) Phase-correlated shared deep convolutional neural network speech enhancement method
CN105741849A (en) Voice enhancement method for fusing phase estimation and human ear hearing characteristics in digital hearing aid
Xiang et al. A nested u-net with self-attention and dense connectivity for monaural speech enhancement
CN102347028A (en) Double-microphone speech enhancer and speech enhancement method thereof
US20070255560A1 (en) Low complexity noise reduction method
CN112017682B (en) Single-channel voice simultaneous noise reduction and reverberation removal system
CN106340292A (en) Voice enhancement method based on continuous noise estimation
US9390718B2 (en) Audio signal restoration device and audio signal restoration method
EP2346032A1 (en) Noise suppression device and audio decoding device
CN104835503A (en) Improved GSC self-adaptive speech enhancement method
CN110875049B (en) Voice signal processing method and device
CN103632677A (en) Method and device for processing voice signal with noise, and server
CN107180643A (en) One kind is uttered long and high-pitched sounds sound detection and elimination system
WO2021007841A1 (en) Noise estimation method, noise estimation apparatus, speech processing chip and electronic device
CN111105809B (en) Noise reduction method and device
CN114360560A (en) Speech enhancement post-processing method and device based on harmonic structure prediction
CN107045874B (en) Non-linear voice enhancement method based on correlation
CN112185405A (en) Bone conduction speech enhancement method based on differential operation and joint dictionary learning
Rao et al. Speech enhancement using sub-band cross-correlation compensated Wiener filter combined with harmonic regeneration
CN114067825A (en) Comfort noise generation method based on time-frequency masking estimation and application thereof
Chen et al. Gesper: A unified framework for general speech restoration

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination