CN114360560A - Speech enhancement post-processing method and device based on harmonic structure prediction - Google Patents
Speech enhancement post-processing method and device based on harmonic structure prediction Download PDFInfo
- Publication number
- CN114360560A CN114360560A CN202210049231.6A CN202210049231A CN114360560A CN 114360560 A CN114360560 A CN 114360560A CN 202210049231 A CN202210049231 A CN 202210049231A CN 114360560 A CN114360560 A CN 114360560A
- Authority
- CN
- China
- Prior art keywords
- time
- estimation
- spectral density
- power spectral
- frequency
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 68
- 238000012805 post-processing Methods 0.000 title claims abstract description 26
- 230000000873 masking effect Effects 0.000 claims abstract description 65
- 230000003595 spectral effect Effects 0.000 claims abstract description 65
- 238000012937 correction Methods 0.000 claims abstract description 31
- 238000004364 calculation method Methods 0.000 claims description 41
- 238000000354 decomposition reaction Methods 0.000 claims description 16
- 238000013135 deep learning Methods 0.000 claims description 12
- 230000015572 biosynthetic process Effects 0.000 claims description 11
- 238000003786 synthesis reaction Methods 0.000 claims description 11
- 238000009499 grossing Methods 0.000 claims description 10
- 230000008447 perception Effects 0.000 abstract description 3
- 230000010365 information processing Effects 0.000 abstract description 2
- 238000004891 communication Methods 0.000 description 8
- 230000006870 function Effects 0.000 description 5
- 230000001629 suppression Effects 0.000 description 5
- 238000003672 processing method Methods 0.000 description 4
- 238000001228 spectrum Methods 0.000 description 4
- 230000007547 defect Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 238000000926 separation method Methods 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000000737 periodic effect Effects 0.000 description 2
- 238000011084 recovery Methods 0.000 description 2
- 230000002829 reductive effect Effects 0.000 description 2
- 230000002238 attenuated effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000002401 inhibitory effect Effects 0.000 description 1
- 230000036961 partial effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0224—Processing in the time domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0316—Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)
Abstract
The invention discloses a speech enhancement post-processing method and a speech enhancement post-processing device based on harmonic structure prediction, which belong to the field of information processing, and comprise the following steps: s1: carrying out short-time Fourier transform on a voice signal of a microphone to obtain a time-frequency domain expression; s2: carrying out harmonic loss estimation and correction on the time-frequency domain signal to obtain estimated power spectral density; s3: estimating a time-frequency masking value according to the power spectral density; s4: and according to the estimated time-frequency masking value, acquiring frequency domain estimation of the target voice, and further acquiring time domain estimation of the target voice. The invention can predict the lost harmonic structure to a certain extent, the recovered voice is more in line with the characteristics of near-speaking voice, and the intelligibility and the voice perception quality are higher.
Description
Technical Field
The invention belongs to the field of information processing, and particularly relates to a speech enhancement post-processing method and device based on harmonic structure prediction.
Background
Background noise can degrade the communication quality of a speech system in many applications, such as voice conferencing systems. The suppression of the noise signal collected by the microphone by the algorithm is one of the key technologies necessary for the conference system related application. However, the noise suppression method suppresses noise and also damages the voice signal. Therefore, it is also necessary to consider how to enhance the speech signal, especially the harmonic structure of speech, while suppressing noise.
In the prior art, noise suppression and voice enhancement are key technologies of voice communication quality in a conference system or conference equipment. The conventional signal processing method is to track the noise power spectral density and the voice power spectral density in the signal and then construct a masking value of 0 to 1 in the frequency domain based on wiener filtering. After the original signal is masked, the purpose of inhibiting background noise is achieved. In order to overcome the defect that the traditional signal processing method is ineffective to non-stationary noise, the technology for performing time-frequency masking estimation based on deep learning is more and more mature and is more and more applied. The main idea is to estimate the time-frequency masking value directly from the mixed signal by training the noisy data set to the clean speech signal. At present, the effect of noise suppression based on deep learning is superior to that of the traditional signal processing method. Both traditional signal processing methods and deep learning methods present a potential risk of speech distortion. Because the energy of the voice signal is mainly distributed on the harmonic structure, how to enhance the voice signal through the prediction of the harmonic structure has great significance for improving the voice communication quality.
At present, the main disadvantages of the methods for estimating the time-frequency masking value in the prior art are as follows: 1. the existing time-frequency masking method is easy to ignore the harmonic structure of voice, so that partial harmonic is lost and the communication quality is influenced; 2. under a far-distance speaking scene, high-frequency harmonics can be weakened due to the influence of reverberation on voice, and the weakened high-frequency harmonics cannot be recovered by the conventional time-frequency masking method.
The information disclosed in this background section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.
Disclosure of Invention
The invention aims to provide a speech enhancement post-processing method and a speech enhancement post-processing device based on harmonic structure prediction, which can predict a lost harmonic structure to a certain extent, and the recovered speech is more consistent with the characteristics of near-speaking speech and has higher intelligibility and speech perception quality.
In order to achieve the above object, the present invention provides a speech enhancement post-processing method based on harmonic structure prediction, which includes the following steps:
s1: carrying out short-time Fourier transform on a voice signal of a microphone to obtain a time-frequency domain expression;
s2: carrying out harmonic loss estimation and correction on the time-frequency domain signal to obtain estimated power spectral density;
s3: estimating a time-frequency masking value according to the power spectral density;
s4: and according to the estimated time-frequency masking value, acquiring frequency domain estimation of the target voice, and further acquiring time domain estimation of the target voice.
In an embodiment of the present invention, the step S1 is preceded by: acquiring a voice signal x (n) of a microphone;
the short-time fourier transform of the speech signal x (n) of the microphone in step S1 is as follows:
in an embodiment of the present invention, the step S2 specifically includes the following steps:
s201: calculating the masked time domain signal by adopting the time frequency masking value M (l, k) estimated by deep learningThe specific calculation process is as follows:
s202: for the masked time domain signalPerforming half-wave rectificationAfter the flow, a Fourier transform is performed to obtain an estimate of the harmonic lossThe specific calculation formula is as follows:
s203: for the harmonic loss estimation correction, the correction process is as follows:
s204: for each frequency band k, estimating the power spectral density by adopting a uniform smoothing factor alpha; wherein, the power spectral density comprises the power spectral density of background noise, the power spectral density of time-frequency masked voice and the power spectral density of harmonic loss;
the power spectral density estimation process is as follows:
ρv(k)=αρv(k)+(1-α)(1-M(l,k))|X(l,k)|2
ρs(k)=αρs(k)+(1-α)M(l,k)|X(l,k)|2
in an embodiment of the present invention, in the step S3, for each frequency band k, a time-frequency masking value G (l, k) is estimated, and the estimation process is as follows:
in one embodiment of the present invention, in the step S4, the frequency domain estimation of the target speechThe acquisition process is as follows:
y^(″l,k″)=G(l,k)X(l,k),
and then, obtaining a target voice time domain estimation through inverse Fourier transform, wherein the process is as follows:
the invention also provides a speech enhancement post-processing device based on harmonic structure prediction, which comprises a signal decomposition module, a loss estimation module, a masking calculation module and a speech synthesis module, wherein the signal decomposition module comprises a harmonic structure prediction module, a harmonic structure prediction module and a harmonic structure prediction module, and the harmonic structure prediction module comprises a harmonic structure prediction module, a harmonic structure prediction module and a harmonic structure prediction module, wherein the harmonic structure prediction module comprises a harmonic structure prediction module, a harmonic structure prediction module and a harmonic structure prediction module, and the harmonic structure prediction module comprises a harmonic structure prediction module, a harmonic structure prediction module and a harmonic structure prediction module, wherein the harmonic structure prediction module comprises a harmonic structure:
the signal decomposition module is used for carrying out short-time Fourier transform on the voice signal of the microphone to obtain time-frequency domain expression;
the loss estimation module is used for carrying out harmonic loss estimation and correction on the time-frequency domain signal to obtain estimated power spectral density; the harmonic loss compensation method comprises a masking calculation module, a harmonic loss calculation module, a loss estimation correction module and a power spectral density calculation module;
the masking calculation module is used for estimating a time-frequency masking value according to the power spectral density;
and the voice synthesis module is used for acquiring the frequency domain estimation of the target voice according to the estimated time-frequency masking value so as to obtain the time domain estimation of the target voice.
In an embodiment of the present invention, the signal decomposition module is further configured to obtain a speech signal x (n) of a microphone;
in the signal decomposition module, the process of performing short-time fourier transform on the speech signal x (n) of the microphone is as follows:
in one embodiment of the present inventionThe masking calculation module is configured to calculate a masked time-domain signal by using a time-frequency masking value M (l, k) estimated by deep learningThe specific calculation process is as follows:
the harmonic loss calculation module is used for calculating the masked time domain signalAfter half-wave rectification, Fourier transform is carried out to obtain estimation of harmonic lossThe specific calculation formula is as follows:
the loss estimation and correction module is used for carrying out harmonic loss estimation and correction, and the correction process is as follows:
the power spectral density calculating module is used for estimating the power spectral density of each frequency band k by adopting a unified smoothing factor alpha; wherein the power spectral density comprises the power spectral density of background noise, the power spectral density of time-frequency masked voice and the power spectral density of harmonic loss,
the power spectral density estimation process is as follows:
ρv(k)=αρv(k)+(1-α)(1-M(l,k))|X(l,k)|2
ρs(k)=αρs(k)+(1-α)M(l,k)|X(l,k)|2
in an embodiment of the present invention, the masking calculation module estimates a time-frequency masking value G (l, k) for each frequency band k, and the estimation process is as follows:
in an embodiment of the present invention, in the speech synthesis module, the frequency domain estimation of the target speechThe acquisition process is as follows:
y^(″l,k″)=G(l,k)X(l,k),
and obtaining a target voice time domain estimation through inverse Fourier transform, wherein the process is as follows:
the invention provides a speech enhancement post-processing method and a speech enhancement post-processing device based on harmonic structure prediction, which have the following beneficial effects:
1. according to the method and the device, the lost harmonic component of the time-frequency masking information obtained based on deep learning is estimated, the more lost harmonic component is obtained, new time-frequency masking information is constructed, the harmonic component can be effectively recovered, and the voice perception quality is improved.
2. The invention adopts novel time-frequency masking estimation, considers the specific requirements of voice communication, considers noise suppression, recovers the lost harmonic waves to a certain degree and has better communication quality.
Drawings
Fig. 1 is a flowchart of a speech enhancement post-processing method based on harmonic structure prediction in the present embodiment.
Fig. 2 is a diagram of a hamming window function used in this embodiment.
Fig. 3 is a schematic diagram of a speech enhancement post-processing device based on harmonic structure prediction according to the present embodiment.
Detailed Description
The present invention will be described in further detail with reference to specific embodiments in order to make the technical field better understand the scheme of the present invention.
As shown in fig. 1, an embodiment of the present invention is a speech enhancement post-processing method based on harmonic structure prediction.
The method specifically comprises the following four implementation steps:
s1: and carrying out short-time Fourier transform on the voice signal of the microphone to obtain a time-frequency domain expression.
The voice signal of the microphone is a digital signal obtained after sound pressure collected by the microphone passes through the ADC.
Before step S1, the method further includes acquiring a voice signal of the microphone, where the acquired voice signal is as follows: let x (n) represent the original time domain signal picked up by the microphone element in real time, where n represents the time tag.
Specifically, the process of performing short-time fourier transform on the speech signal x (n) of the microphone to obtain the time-frequency domain expression is as follows:
wherein, N is the frame length, and N is 512; w (n) is a Hamming window of length 512, where n represents a time stamp, i.e., a time sequence number, and thus w (n) represents the value at each corresponding time sequence number n; 1 is a time frame sequence number, and takes a frame as a unit; k is a frequency band number, wherein a frequency band refers to a signal component corresponding to a certain frequency; j representsUnit of imaginary numberX (l, k) is the speech signal of the mth microphone, and in the mth frame, the frequency spectrum of the kth frequency band.
The hamming window function used in the present invention is shown in fig. 2.
Through the above step S1, the time domain signal of the voice signal of the microphone can be converted into a time-frequency domain signal due to the processing of the subsequent steps.
S2: and performing harmonic loss estimation and correction on the time-frequency domain signal to obtain estimated power spectral density.
Specifically, the time-frequency masking value of deep learning estimation is adopted to calculate the masked time-domain signal, and the harmonic loss is estimated and corrected, so that the power spectral density is estimated.
Specifically, the present step S2 includes the steps of:
s201: calculating the masked time domain signal by adopting the time frequency masking value M (l, k) estimated by deep learningThe specific calculation process is as follows:
the time-frequency masking M (l, k) is a common scheme for speech enhancement based on deep learning.
The time domain signal output in step S201 has a certain harmonic loss, that is, energy in a part of frequency band of the periodic harmonic frequency band (fundamental frequency and its multiple frequency) is significantly attenuated, but most of the harmonic structure is preserved, which can be used for calculating the lost harmonic structure in the subsequent step.
Unlike step S1, step S201 does not require a windowing function in the present inverse fourier transform.
S202: for the masked time domain signalAfter half-wave rectification, Fourier transform is carried out to obtain estimation of harmonic lossThe specific calculation formula is as follows:
where sign () represents a half-wave rectification operation.
In step S202, smoothing of adjacent harmonic frequency bands can be achieved by half-wave rectification and then fourier transform, and damaged harmonics can be partially recovered. In the estimation of the harmonic loss obtainedIf there is a certain under-estimation, correction will be performed in the subsequent step S203.
S203: carrying out harmonic loss estimation correction, wherein the correction process is as follows:
the harmonic loss estimated in the previous step S202 has a certain error, and the result of the estimation can be corrected in this step S203.
The principle of the correction method adopted in step S203 is that a time-frequency masking method is adopted, and the masking value M (l, k) is always smaller than 1. Therefore, if estimation of harmonic lossIs less than its original value Y (l, k), indicating that the harmonics are not recovered, and the original estimate is still used as a maskThe mask value. The result of this step S203 can be used to update the power spectral density and to update the time-frequency mask estimate in subsequent steps.
S204: and estimating the power spectral density by adopting a uniform smoothing factor alpha for each frequency band k, wherein the power spectral density comprises the power spectral density of background noise, the power spectral density of time-frequency masked voice and the power spectral density of harmonic loss.
The power spectral density estimation process is as follows:
ρv(k)=αρv(k)+(1-α)(1-M(l,k))|X(l,k)|2
ρs(k)=αρs(k)+(1-α)M(l,k)|X(l,k)|2
where ρ isv(k)、ρs(k) And ρh(k) Respectively representing the power spectral density of background noise, voice after time-frequency masking and harmonic loss. Rhoh(k) It can be understood as the average result of harmonic loss over a period of time, hence using ph(k) Harmonic recovery is performed with better robustness. Alpha is a smoothing factor between adjacent frames, the value range is between 0 and 1, if the value is too small, the power spectral density estimation change amplitude is too large, and the defect of instability exists, and if the value is too high, the energy estimation is too stable, and the modeling capability of a non-stable signal is reduced. The invention preferably has alpha of 0.95, and can balance stability and capability of modeling non-stationary noise.
The output structure of this step S204 can be used in subsequent steps to construct masking values that can restore the harmonic structure.
S3: and estimating a time-frequency masking value according to the power spectral density.
Specifically, for each frequency band k, a time-frequency masking value G (l, k) is estimated, and the estimation process is as follows:
where max () represents taking the larger of the two, in this step S3,the harmonic structure is enhanced, and the calculated time-frequency masking value is used for obtaining the voice frequency spectrum estimation in the subsequent step.
S4: and according to the estimated time-frequency masking value, acquiring frequency domain estimation of the target voice, and further acquiring time domain estimation of the target voice.
y^(″l,k″)=G(l,k)X(l,k)
and then, obtaining a target voice time domain estimation through inverse Fourier transform, wherein the process is as follows:
through the step S4, the time domain estimated signal can be directly converted into a voltage signal through digital-to-analog conversion, and the enhanced voice signal is played by a speaker.
Through the steps S1-S4, signal time-frequency decomposition, harmonic loss estimation, time-frequency masking calculation and target voice synthesis can be realized, and finally, the voice communication quality is improved.
As shown in fig. 3, an embodiment of the present invention is a speech enhancement post-processing apparatus based on harmonic structure prediction, and includes a signal decomposition module 1, a loss estimation module 2, a masking calculation module 3, and a speech synthesis module 4.
And the signal decomposition module 1 is used for carrying out short-time Fourier transform on the voice signal of the microphone to obtain a time-frequency domain expression.
The signal decomposition module 1 can also be used to obtain a speech signal of the microphone and an echo reference signal, where the obtained speech signal is as follows: let x (n) represent the original time domain signal picked up by the microphone element in real time, where n represents the time tag.
Specifically, the method of performing the short-time fourier transform is as follows:
carrying out short-time Fourier transform on the time domain signal x (n) to obtain a time-frequency domain expression:
wherein, N is the frame length, and N is 512; w (n) is a Hamming window of length 512, where n represents a time stamp, i.e., a time sequence number, and thus w (n) represents the value at each corresponding time sequence number n; 1 is a time frame sequence number, and takes a frame as a unit; k is a frequency band number, wherein a frequency band refers to a signal component corresponding to a certain frequency; j represents an imaginary unitX (l, k) is the speech signal of the mth microphone, and in the 1 st frame, the frequency spectrum of the kth frequency band.
The hamming window function used in the present invention is shown in fig. 2.
The time domain signal of the voice signal of the microphone can be converted into a time-frequency domain signal by the signal decomposition module 1.
And the loss estimation module 2 is configured to perform harmonic loss estimation on the time-frequency domain signal, estimate a loss harmonic component in the time-frequency domain signal, and obtain an estimated power spectral density. The loss estimation module 2 comprises a masking calculation module, a harmonic loss calculation module, a loss estimation correction module and a power spectral density calculation module.
Specifically, the masking calculation module is configured to calculate a masked time domain signal with a time-frequency masking value M (l, k) estimated by deep learningThe specific calculation process is as follows:
the time domain signal output by the masking calculation module has a certain harmonic loss, namely, the energy on a part of frequency bands (fundamental frequency and frequency multiplication) of the periodic harmonic frequency band is obviously weakened, but most of harmonic structures are reserved and can be used for calculating the lost harmonic structures subsequently.
The masking computation module differs from the signal decomposition module 1 in that the inverse fourier transform does not require a windowing function.
Harmonic loss calculation module for masking time domain signalAfter half-wave rectification, Fourier transform is carried out to obtain estimation of harmonic lossThe specific calculation process is as follows:
where sign () represents a half-wave rectification operation.
The harmonic loss calculation module can realize the smoothing of adjacent harmonic frequency bands by a mode of half-wave rectification and then Fourier transform, and can partially recover damaged harmonics. In the obtained estimation of the harmonic loss, there is a certain under-estimation, and the correction will be performed subsequently.
The loss estimation and correction module is used for carrying out harmonic loss estimation and correction, and the correction process is as follows:
the harmonic loss estimated in the harmonic loss calculation module has certain error, and the loss estimation correction module can correct the estimated result.
The principle of the correction method adopted by the loss estimation correction module is that a time-frequency masking method is adopted, and the masking value M (l, k) is always smaller than 1. Therefore, if estimation of harmonic lossThe value of (c) is less than its original value X (l, k), indicating that the harmonics are not recovered, and the original estimate is still used as the masking value. The results of the loss estimation correction module can be used to subsequently update the power spectral density and update the time-frequency mask estimate.
And the power spectral density calculation module is used for estimating the power spectral density by adopting a uniform smoothing factor alpha for each frequency band k, wherein the power spectral density comprises the power spectral density of background noise, the power spectral density of the time-frequency masked voice and the power spectral density of harmonic loss.
The power spectral density estimation process is as follows:
ρv(k)=αρv(k)+(1-α)(1-M(l,k))|X(l,k)|2
ρs(k)=αρs(k)+(1-α)M(l,k)|X(l,k)|2
where ρ isv(k)、ρs(k) And ρh(k) Respectively representing the power spectral density of background noise, voice after time-frequency masking and harmonic loss. Rhoh(k) It can be understood as the average result of harmonic loss over a period of time, hence using ph(k) Harmonic recovery is performed with better robustness. Alpha is a smoothing factor between adjacent frames and has a value ranging between 0 and 1. The invention preferably selects a to 0.95, if the value is too small, the power spectral density estimation will change too much, and there is a defect of instability, if the value is too high,the energy estimate is too stationary and the ability to model non-stationary signals is reduced.
And the separation matrix calculation module 3 is used for estimating the time-frequency masking value according to the power spectral density.
Specifically, for each frequency band k, the time-frequency masking value G (l, k) is estimated, and the estimation process is as follows:
where max () represents taking the large of the two. In the separation matrix calculation block 3, the separation matrix,the harmonic structure is enhanced, and the calculated time-frequency masking value is used for obtaining the voice frequency spectrum estimation in the subsequent step.
And the voice synthesis module 4 is configured to obtain a frequency domain estimation of the target voice according to the estimated time-frequency masking value, and further obtain a time domain estimation of the target voice.
y^(″l,k″)=G(l,k)X(l,k)
and then, obtaining a target voice time domain estimation through inverse Fourier transform, wherein the process is as follows:
through the speech synthesis module 4, the time domain estimated signal can be directly converted into a voltage signal through digital-to-analog conversion, and the enhanced speech signal is played by a loudspeaker.
The speech enhancement post-processing device based on harmonic structure prediction can realize signal time-frequency decomposition, harmonic loss estimation, time-frequency masking calculation and target speech synthesis, and finally improve the speech communication quality.
The foregoing descriptions of specific exemplary embodiments of the present invention have been presented for purposes of illustration and description. It is not intended to limit the invention to the precise form disclosed, and obviously many modifications and variations are possible in light of the above teaching. The exemplary embodiments were chosen and described in order to explain certain principles of the invention and its practical application to enable one skilled in the art to make and use various exemplary embodiments of the invention and various alternatives and modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims and their equivalents.
Claims (10)
1. A speech enhancement post-processing method based on harmonic structure prediction is characterized by comprising the following steps:
s1: carrying out short-time Fourier transform on a voice signal of a microphone to obtain a time-frequency domain expression;
s2: carrying out harmonic loss estimation and correction on the time-frequency domain signal to obtain estimated power spectral density;
s3: estimating a time-frequency masking value according to the power spectral density;
s4: and according to the estimated time-frequency masking value, acquiring frequency domain estimation of the target voice, and further acquiring time domain estimation of the target voice.
2. The method for speech enhancement post-processing based on harmonic structure prediction according to claim 1, wherein said step S1 is preceded by the steps of: acquiring a voice signal x (n) of a microphone;
the short-time fourier transform of the speech signal x (n) of the microphone in step S1 is as follows:
3. the harmonic structure prediction-based speech enhancement post-processing method according to claim 2, wherein the step S2 specifically comprises the following steps:
s201: calculating the masked time domain signal by adopting the time frequency masking value M (l, k) estimated by deep learningThe specific calculation process is as follows:
s202: for the masked time domain signalAfter half-wave rectification, Fourier transform is carried out to obtain estimation of harmonic lossThe specific calculation formula is as follows:
s203: for the harmonic loss estimation correction, the correction process is as follows:
s204: for each frequency band k, estimating the power spectral density by adopting a uniform smoothing factor alpha; wherein, the power spectral density comprises the power spectral density of background noise, the power spectral density of time-frequency masked voice and the power spectral density of harmonic loss;
the power spectral density estimation process is as follows:
ρv(k)=αρv(k)+(1-α)(1-M(l,k))|X(l,k)|2
ρs(k)=αρs(k)+(1-α)M(l,k)|X(l,k)|2
5. the method for speech enhancement post-processing based on harmonic structure prediction according to claim 4, characterized in that in step S4, the frequency domain estimation of the target speech isThe acquisition process is as follows:
y^(″l,k″)=G(l,k)X(l,k),
and then, obtaining a target voice time domain estimation through inverse Fourier transform, wherein the process is as follows:
6. a speech enhancement post-processing device based on harmonic structure prediction is characterized by comprising a signal decomposition module, a loss estimation module, a masking calculation module and a speech synthesis module:
the signal decomposition module is used for carrying out short-time Fourier transform on the voice signal of the microphone to obtain time-frequency domain expression;
the loss estimation module is used for carrying out harmonic loss estimation and correction on the time-frequency domain signal to obtain estimated power spectral density; the harmonic loss compensation method comprises a masking calculation module, a harmonic loss calculation module, a loss estimation correction module and a power spectral density calculation module;
the masking calculation module is used for estimating a time-frequency masking value according to the power spectral density;
and the voice synthesis module is used for acquiring the frequency domain estimation of the target voice according to the estimated time-frequency masking value so as to obtain the time domain estimation of the target voice.
7. The harmonic structure prediction-based speech enhancement post-processing apparatus according to claim 6, wherein the signal decomposition module is further configured to obtain a speech signal x (n) of a microphone;
in the signal decomposition module, the process of performing short-time fourier transform on the speech signal x (n) of the microphone is as follows:
8. the harmonic structure prediction-based speech enhancement post-processing device according to claim 7, wherein the masking calculation module is configured to calculate the masked time-domain signal using the time-frequency masking value M (l, k) estimated by deep learningThe specific calculation process is as follows:
the harmonic loss calculation module is used for calculating the masked harmonic lossOf the time domain signalAfter half-wave rectification, Fourier transform is carried out to obtain estimation of harmonic lossThe specific calculation formula is as follows:
the loss estimation and correction module is used for carrying out harmonic loss estimation and correction, and the correction process is as follows:
the power spectral density calculating module is used for estimating the power spectral density of each frequency band k by adopting a unified smoothing factor alpha; wherein the power spectral density comprises the power spectral density of background noise, the power spectral density of time-frequency masked voice and the power spectral density of harmonic loss,
the power spectral density estimation process is as follows:
ρv(k)=αρv(k)+(1-α)(1-M(l,k))|X(l,k)|2
ρs(k)=αρs(k)+(1-α)M(l,k)|X(l,k)|2
10. the harmonic structure prediction based speech enhancement post-processing apparatus as claimed in claim 9, wherein in the speech synthesis module, the frequency domain estimation of the target speechThe acquisition process is as follows:
y^(″l,k″)=G(l,k)X(l,k),
and obtaining a target voice time domain estimation through inverse Fourier transform, wherein the process is as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210049231.6A CN114360560A (en) | 2022-01-17 | 2022-01-17 | Speech enhancement post-processing method and device based on harmonic structure prediction |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210049231.6A CN114360560A (en) | 2022-01-17 | 2022-01-17 | Speech enhancement post-processing method and device based on harmonic structure prediction |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114360560A true CN114360560A (en) | 2022-04-15 |
Family
ID=81092145
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210049231.6A Pending CN114360560A (en) | 2022-01-17 | 2022-01-17 | Speech enhancement post-processing method and device based on harmonic structure prediction |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114360560A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115862656A (en) * | 2023-02-03 | 2023-03-28 | 中国科学院自动化研究所 | Method, device, equipment and storage medium for enhancing bone-conduction microphone voice |
-
2022
- 2022-01-17 CN CN202210049231.6A patent/CN114360560A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115862656A (en) * | 2023-02-03 | 2023-03-28 | 中国科学院自动化研究所 | Method, device, equipment and storage medium for enhancing bone-conduction microphone voice |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108831499B (en) | Speech enhancement method using speech existence probability | |
US11056130B2 (en) | Speech enhancement method and apparatus, device and storage medium | |
CN108735213B (en) | Voice enhancement method and system based on phase compensation | |
CN109584903B (en) | Multi-user voice separation method based on deep learning | |
CN111081268A (en) | Phase-correlated shared deep convolutional neural network speech enhancement method | |
CN105741849A (en) | Voice enhancement method for fusing phase estimation and human ear hearing characteristics in digital hearing aid | |
Xiang et al. | A nested u-net with self-attention and dense connectivity for monaural speech enhancement | |
CN102347028A (en) | Double-microphone speech enhancer and speech enhancement method thereof | |
US20070255560A1 (en) | Low complexity noise reduction method | |
CN112017682B (en) | Single-channel voice simultaneous noise reduction and reverberation removal system | |
CN106340292A (en) | Voice enhancement method based on continuous noise estimation | |
US9390718B2 (en) | Audio signal restoration device and audio signal restoration method | |
EP2346032A1 (en) | Noise suppression device and audio decoding device | |
CN104835503A (en) | Improved GSC self-adaptive speech enhancement method | |
CN110875049B (en) | Voice signal processing method and device | |
CN103632677A (en) | Method and device for processing voice signal with noise, and server | |
CN107180643A (en) | One kind is uttered long and high-pitched sounds sound detection and elimination system | |
WO2021007841A1 (en) | Noise estimation method, noise estimation apparatus, speech processing chip and electronic device | |
CN111105809B (en) | Noise reduction method and device | |
CN114360560A (en) | Speech enhancement post-processing method and device based on harmonic structure prediction | |
CN107045874B (en) | Non-linear voice enhancement method based on correlation | |
CN112185405A (en) | Bone conduction speech enhancement method based on differential operation and joint dictionary learning | |
Rao et al. | Speech enhancement using sub-band cross-correlation compensated Wiener filter combined with harmonic regeneration | |
CN114067825A (en) | Comfort noise generation method based on time-frequency masking estimation and application thereof | |
Chen et al. | Gesper: A unified framework for general speech restoration |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |