CN114360560A

CN114360560A - Speech enhancement post-processing method and device based on harmonic structure prediction

Info

Publication number: CN114360560A
Application number: CN202210049231.6A
Authority: CN
Inventors: 何平; 蒋升
Original assignee: Suirui Technology Group Co Ltd
Current assignee: Suirui Technology Group Co Ltd
Priority date: 2022-01-17
Filing date: 2022-01-17
Publication date: 2022-04-15

Abstract

The invention discloses a speech enhancement post-processing method and a speech enhancement post-processing device based on harmonic structure prediction, which belong to the field of information processing, and comprise the following steps: s1: carrying out short-time Fourier transform on a voice signal of a microphone to obtain a time-frequency domain expression; s2: carrying out harmonic loss estimation and correction on the time-frequency domain signal to obtain estimated power spectral density; s3: estimating a time-frequency masking value according to the power spectral density; s4: and according to the estimated time-frequency masking value, acquiring frequency domain estimation of the target voice, and further acquiring time domain estimation of the target voice. The invention can predict the lost harmonic structure to a certain extent, the recovered voice is more in line with the characteristics of near-speaking voice, and the intelligibility and the voice perception quality are higher.

Description

Speech enhancement post-processing method and device based on harmonic structure prediction

Technical Field

The invention belongs to the field of information processing, and particularly relates to a speech enhancement post-processing method and device based on harmonic structure prediction.

Background

Background noise can degrade the communication quality of a speech system in many applications, such as voice conferencing systems. The suppression of the noise signal collected by the microphone by the algorithm is one of the key technologies necessary for the conference system related application. However, the noise suppression method suppresses noise and also damages the voice signal. Therefore, it is also necessary to consider how to enhance the speech signal, especially the harmonic structure of speech, while suppressing noise.

In the prior art, noise suppression and voice enhancement are key technologies of voice communication quality in a conference system or conference equipment. The conventional signal processing method is to track the noise power spectral density and the voice power spectral density in the signal and then construct a masking value of 0 to 1 in the frequency domain based on wiener filtering. After the original signal is masked, the purpose of inhibiting background noise is achieved. In order to overcome the defect that the traditional signal processing method is ineffective to non-stationary noise, the technology for performing time-frequency masking estimation based on deep learning is more and more mature and is more and more applied. The main idea is to estimate the time-frequency masking value directly from the mixed signal by training the noisy data set to the clean speech signal. At present, the effect of noise suppression based on deep learning is superior to that of the traditional signal processing method. Both traditional signal processing methods and deep learning methods present a potential risk of speech distortion. Because the energy of the voice signal is mainly distributed on the harmonic structure, how to enhance the voice signal through the prediction of the harmonic structure has great significance for improving the voice communication quality.

At present, the main disadvantages of the methods for estimating the time-frequency masking value in the prior art are as follows: 1. the existing time-frequency masking method is easy to ignore the harmonic structure of voice, so that partial harmonic is lost and the communication quality is influenced; 2. under a far-distance speaking scene, high-frequency harmonics can be weakened due to the influence of reverberation on voice, and the weakened high-frequency harmonics cannot be recovered by the conventional time-frequency masking method.

The information disclosed in this background section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.

Disclosure of Invention

The invention aims to provide a speech enhancement post-processing method and a speech enhancement post-processing device based on harmonic structure prediction, which can predict a lost harmonic structure to a certain extent, and the recovered speech is more consistent with the characteristics of near-speaking speech and has higher intelligibility and speech perception quality.

In order to achieve the above object, the present invention provides a speech enhancement post-processing method based on harmonic structure prediction, which includes the following steps:

s1: carrying out short-time Fourier transform on a voice signal of a microphone to obtain a time-frequency domain expression;

s2: carrying out harmonic loss estimation and correction on the time-frequency domain signal to obtain estimated power spectral density;

s3: estimating a time-frequency masking value according to the power spectral density;

s4: and according to the estimated time-frequency masking value, acquiring frequency domain estimation of the target voice, and further acquiring time domain estimation of the target voice.

In an embodiment of the present invention, the step S1 is preceded by: acquiring a voice signal x (n) of a microphone;

the short-time fourier transform of the speech signal x (n) of the microphone in step S1 is as follows:

in an embodiment of the present invention, the step S2 specifically includes the following steps:

s201: calculating the masked time domain signal by adopting the time frequency masking value M (l, k) estimated by deep learning

The specific calculation process is as follows:

s202: for the masked time domain signal

Performing half-wave rectificationAfter the flow, a Fourier transform is performed to obtain an estimate of the harmonic loss

The specific calculation formula is as follows:

s203: for the harmonic loss estimation correction, the correction process is as follows:

s204: for each frequency band k, estimating the power spectral density by adopting a uniform smoothing factor alpha; wherein, the power spectral density comprises the power spectral density of background noise, the power spectral density of time-frequency masked voice and the power spectral density of harmonic loss;

the power spectral density estimation process is as follows:

ρ_v(k)＝αρ_v(k)+(1-α)(1-M(l，k))|X(l，k)|²

ρ_s(k)＝αρ_s(k)+(1-α)M(l，k)|X(l，k)|²

in an embodiment of the present invention, in the step S3, for each frequency band k, a time-frequency masking value G (l, k) is estimated, and the estimation process is as follows:

in one embodiment of the present invention, in the step S4, the frequency domain estimation of the target speech

The acquisition process is as follows:

y^(″l，k″)＝G(l，k)X(l，k)，

and then, obtaining a target voice time domain estimation through inverse Fourier transform, wherein the process is as follows:

the invention also provides a speech enhancement post-processing device based on harmonic structure prediction, which comprises a signal decomposition module, a loss estimation module, a masking calculation module and a speech synthesis module, wherein the signal decomposition module comprises a harmonic structure prediction module, a harmonic structure prediction module and a harmonic structure prediction module, and the harmonic structure prediction module comprises a harmonic structure prediction module, a harmonic structure prediction module and a harmonic structure prediction module, wherein the harmonic structure prediction module comprises a harmonic structure prediction module, a harmonic structure prediction module and a harmonic structure prediction module, and the harmonic structure prediction module comprises a harmonic structure prediction module, a harmonic structure prediction module and a harmonic structure prediction module, wherein the harmonic structure prediction module comprises a harmonic structure:

the signal decomposition module is used for carrying out short-time Fourier transform on the voice signal of the microphone to obtain time-frequency domain expression;

the loss estimation module is used for carrying out harmonic loss estimation and correction on the time-frequency domain signal to obtain estimated power spectral density; the harmonic loss compensation method comprises a masking calculation module, a harmonic loss calculation module, a loss estimation correction module and a power spectral density calculation module;

the masking calculation module is used for estimating a time-frequency masking value according to the power spectral density;

and the voice synthesis module is used for acquiring the frequency domain estimation of the target voice according to the estimated time-frequency masking value so as to obtain the time domain estimation of the target voice.

In an embodiment of the present invention, the signal decomposition module is further configured to obtain a speech signal x (n) of a microphone;

in the signal decomposition module, the process of performing short-time fourier transform on the speech signal x (n) of the microphone is as follows:

in one embodiment of the present inventionThe masking calculation module is configured to calculate a masked time-domain signal by using a time-frequency masking value M (l, k) estimated by deep learning

The specific calculation process is as follows:

the harmonic loss calculation module is used for calculating the masked time domain signal

After half-wave rectification, Fourier transform is carried out to obtain estimation of harmonic loss

The specific calculation formula is as follows:

the loss estimation and correction module is used for carrying out harmonic loss estimation and correction, and the correction process is as follows:

the power spectral density calculating module is used for estimating the power spectral density of each frequency band k by adopting a unified smoothing factor alpha; wherein the power spectral density comprises the power spectral density of background noise, the power spectral density of time-frequency masked voice and the power spectral density of harmonic loss,

the power spectral density estimation process is as follows:

ρ_v(k)＝αρ_v(k)+(1-α)(1-M(l，k))|X(l，k)|²

ρ_s(k)＝αρ_s(k)+(1-α)M(l，k)|X(l，k)|²

in an embodiment of the present invention, the masking calculation module estimates a time-frequency masking value G (l, k) for each frequency band k, and the estimation process is as follows:

in an embodiment of the present invention, in the speech synthesis module, the frequency domain estimation of the target speech

The acquisition process is as follows:

y^(″l，k″)＝G(l，k)X(l，k)，

and obtaining a target voice time domain estimation through inverse Fourier transform, wherein the process is as follows:

the invention provides a speech enhancement post-processing method and a speech enhancement post-processing device based on harmonic structure prediction, which have the following beneficial effects:

1. according to the method and the device, the lost harmonic component of the time-frequency masking information obtained based on deep learning is estimated, the more lost harmonic component is obtained, new time-frequency masking information is constructed, the harmonic component can be effectively recovered, and the voice perception quality is improved.

2. The invention adopts novel time-frequency masking estimation, considers the specific requirements of voice communication, considers noise suppression, recovers the lost harmonic waves to a certain degree and has better communication quality.

Drawings

Fig. 1 is a flowchart of a speech enhancement post-processing method based on harmonic structure prediction in the present embodiment.

Fig. 2 is a diagram of a hamming window function used in this embodiment.

Fig. 3 is a schematic diagram of a speech enhancement post-processing device based on harmonic structure prediction according to the present embodiment.

Detailed Description

The present invention will be described in further detail with reference to specific embodiments in order to make the technical field better understand the scheme of the present invention.

As shown in fig. 1, an embodiment of the present invention is a speech enhancement post-processing method based on harmonic structure prediction.

The method specifically comprises the following four implementation steps:

s1: and carrying out short-time Fourier transform on the voice signal of the microphone to obtain a time-frequency domain expression.

The voice signal of the microphone is a digital signal obtained after sound pressure collected by the microphone passes through the ADC.

Before step S1, the method further includes acquiring a voice signal of the microphone, where the acquired voice signal is as follows: let x (n) represent the original time domain signal picked up by the microphone element in real time, where n represents the time tag.

Specifically, the process of performing short-time fourier transform on the speech signal x (n) of the microphone to obtain the time-frequency domain expression is as follows:

wherein, N is the frame length, and N is 512; w (n) is a Hamming window of length 512, where n represents a time stamp, i.e., a time sequence number, and thus w (n) represents the value at each corresponding time sequence number n; 1 is a time frame sequence number, and takes a frame as a unit; k is a frequency band number, wherein a frequency band refers to a signal component corresponding to a certain frequency; j representsUnit of imaginary number

X (l, k) is the speech signal of the mth microphone, and in the mth frame, the frequency spectrum of the kth frequency band.

The hamming window function used in the present invention is shown in fig. 2.

Through the above step S1, the time domain signal of the voice signal of the microphone can be converted into a time-frequency domain signal due to the processing of the subsequent steps.

S2: and performing harmonic loss estimation and correction on the time-frequency domain signal to obtain estimated power spectral density.

Specifically, the time-frequency masking value of deep learning estimation is adopted to calculate the masked time-domain signal, and the harmonic loss is estimated and corrected, so that the power spectral density is estimated.

Specifically, the present step S2 includes the steps of:

The specific calculation process is as follows:

the time-frequency masking M (l, k) is a common scheme for speech enhancement based on deep learning.

The time domain signal output in step S201 has a certain harmonic loss, that is, energy in a part of frequency band of the periodic harmonic frequency band (fundamental frequency and its multiple frequency) is significantly attenuated, but most of the harmonic structure is preserved, which can be used for calculating the lost harmonic structure in the subsequent step.

Unlike step S1, step S201 does not require a windowing function in the present inverse fourier transform.

S202: for the masked time domain signal

The specific calculation formula is as follows:

where sign () represents a half-wave rectification operation.

In step S202, smoothing of adjacent harmonic frequency bands can be achieved by half-wave rectification and then fourier transform, and damaged harmonics can be partially recovered. In the estimation of the harmonic loss obtained

If there is a certain under-estimation, correction will be performed in the subsequent step S203.

S203: carrying out harmonic loss estimation correction, wherein the correction process is as follows:

the harmonic loss estimated in the previous step S202 has a certain error, and the result of the estimation can be corrected in this step S203.

The principle of the correction method adopted in step S203 is that a time-frequency masking method is adopted, and the masking value M (l, k) is always smaller than 1. Therefore, if estimation of harmonic loss

Is less than its original value Y (l, k), indicating that the harmonics are not recovered, and the original estimate is still used as a maskThe mask value. The result of this step S203 can be used to update the power spectral density and to update the time-frequency mask estimate in subsequent steps.

S204: and estimating the power spectral density by adopting a uniform smoothing factor alpha for each frequency band k, wherein the power spectral density comprises the power spectral density of background noise, the power spectral density of time-frequency masked voice and the power spectral density of harmonic loss.

The power spectral density estimation process is as follows:

ρ_v(k)＝αρ_v(k)+(1-α)(1-M(l，k))|X(l，k)|²

ρ_s(k)＝αρ_s(k)+(1-α)M(l，k)|X(l，k)|²

where ρ is_v(k)、ρ_s(k) And ρ_h(k) Respectively representing the power spectral density of background noise, voice after time-frequency masking and harmonic loss. Rho_h(k) It can be understood as the average result of harmonic loss over a period of time, hence using p_h(k) Harmonic recovery is performed with better robustness. Alpha is a smoothing factor between adjacent frames, the value range is between 0 and 1, if the value is too small, the power spectral density estimation change amplitude is too large, and the defect of instability exists, and if the value is too high, the energy estimation is too stable, and the modeling capability of a non-stable signal is reduced. The invention preferably has alpha of 0.95, and can balance stability and capability of modeling non-stationary noise.

The output structure of this step S204 can be used in subsequent steps to construct masking values that can restore the harmonic structure.

S3: and estimating a time-frequency masking value according to the power spectral density.

Specifically, for each frequency band k, a time-frequency masking value G (l, k) is estimated, and the estimation process is as follows:

where max () represents taking the larger of the two, in this step S3,

the harmonic structure is enhanced, and the calculated time-frequency masking value is used for obtaining the voice frequency spectrum estimation in the subsequent step.

Wherein the frequency domain estimation of the target speech

The acquisition process is as follows:

y^(″l，k″)＝G(l，k)X(l，k)

through the step S4, the time domain estimated signal can be directly converted into a voltage signal through digital-to-analog conversion, and the enhanced voice signal is played by a speaker.

Through the steps S1-S4, signal time-frequency decomposition, harmonic loss estimation, time-frequency masking calculation and target voice synthesis can be realized, and finally, the voice communication quality is improved.

As shown in fig. 3, an embodiment of the present invention is a speech enhancement post-processing apparatus based on harmonic structure prediction, and includes a signal decomposition module 1, a loss estimation module 2, a masking calculation module 3, and a speech synthesis module 4.

And the signal decomposition module 1 is used for carrying out short-time Fourier transform on the voice signal of the microphone to obtain a time-frequency domain expression.

The signal decomposition module 1 can also be used to obtain a speech signal of the microphone and an echo reference signal, where the obtained speech signal is as follows: let x (n) represent the original time domain signal picked up by the microphone element in real time, where n represents the time tag.

Specifically, the method of performing the short-time fourier transform is as follows:

carrying out short-time Fourier transform on the time domain signal x (n) to obtain a time-frequency domain expression:

wherein, N is the frame length, and N is 512; w (n) is a Hamming window of length 512, where n represents a time stamp, i.e., a time sequence number, and thus w (n) represents the value at each corresponding time sequence number n; 1 is a time frame sequence number, and takes a frame as a unit; k is a frequency band number, wherein a frequency band refers to a signal component corresponding to a certain frequency; j represents an imaginary unit

X (l, k) is the speech signal of the mth microphone, and in the 1 st frame, the frequency spectrum of the kth frequency band.

The hamming window function used in the present invention is shown in fig. 2.

The time domain signal of the voice signal of the microphone can be converted into a time-frequency domain signal by the signal decomposition module 1.

And the loss estimation module 2 is configured to perform harmonic loss estimation on the time-frequency domain signal, estimate a loss harmonic component in the time-frequency domain signal, and obtain an estimated power spectral density. The loss estimation module 2 comprises a masking calculation module, a harmonic loss calculation module, a loss estimation correction module and a power spectral density calculation module.

Specifically, the masking calculation module is configured to calculate a masked time domain signal with a time-frequency masking value M (l, k) estimated by deep learning

The specific calculation process is as follows:

the time domain signal output by the masking calculation module has a certain harmonic loss, namely, the energy on a part of frequency bands (fundamental frequency and frequency multiplication) of the periodic harmonic frequency band is obviously weakened, but most of harmonic structures are reserved and can be used for calculating the lost harmonic structures subsequently.

The masking computation module differs from the signal decomposition module 1 in that the inverse fourier transform does not require a windowing function.

Harmonic loss calculation module for masking time domain signal

The specific calculation process is as follows:

where sign () represents a half-wave rectification operation.

The harmonic loss calculation module can realize the smoothing of adjacent harmonic frequency bands by a mode of half-wave rectification and then Fourier transform, and can partially recover damaged harmonics. In the obtained estimation of the harmonic loss, there is a certain under-estimation, and the correction will be performed subsequently.

the harmonic loss estimated in the harmonic loss calculation module has certain error, and the loss estimation correction module can correct the estimated result.

The principle of the correction method adopted by the loss estimation correction module is that a time-frequency masking method is adopted, and the masking value M (l, k) is always smaller than 1. Therefore, if estimation of harmonic loss

The value of (c) is less than its original value X (l, k), indicating that the harmonics are not recovered, and the original estimate is still used as the masking value. The results of the loss estimation correction module can be used to subsequently update the power spectral density and update the time-frequency mask estimate.

And the power spectral density calculation module is used for estimating the power spectral density by adopting a uniform smoothing factor alpha for each frequency band k, wherein the power spectral density comprises the power spectral density of background noise, the power spectral density of the time-frequency masked voice and the power spectral density of harmonic loss.

The power spectral density estimation process is as follows:

ρ_v(k)＝αρ_v(k)+(1-α)(1-M(l，k))|X(l，k)|²

ρ_s(k)＝αρ_s(k)+(1-α)M(l，k)|X(l，k)|²

where ρ is_v(k)、ρ_s(k) And ρ_h(k) Respectively representing the power spectral density of background noise, voice after time-frequency masking and harmonic loss. Rho_h(k) It can be understood as the average result of harmonic loss over a period of time, hence using p_h(k) Harmonic recovery is performed with better robustness. Alpha is a smoothing factor between adjacent frames and has a value ranging between 0 and 1. The invention preferably selects a to 0.95, if the value is too small, the power spectral density estimation will change too much, and there is a defect of instability, if the value is too high,the energy estimate is too stationary and the ability to model non-stationary signals is reduced.

And the separation matrix calculation module 3 is used for estimating the time-frequency masking value according to the power spectral density.

Specifically, for each frequency band k, the time-frequency masking value G (l, k) is estimated, and the estimation process is as follows:

where max () represents taking the large of the two. In the separation matrix calculation block 3, the separation matrix,

And the voice synthesis module 4 is configured to obtain a frequency domain estimation of the target voice according to the estimated time-frequency masking value, and further obtain a time domain estimation of the target voice.

Wherein the frequency domain estimation of the target speech

The acquisition process is as follows:

y^(″l，k″)＝G(l，k)X(l，k)

through the speech synthesis module 4, the time domain estimated signal can be directly converted into a voltage signal through digital-to-analog conversion, and the enhanced speech signal is played by a loudspeaker.

The speech enhancement post-processing device based on harmonic structure prediction can realize signal time-frequency decomposition, harmonic loss estimation, time-frequency masking calculation and target speech synthesis, and finally improve the speech communication quality.

The foregoing descriptions of specific exemplary embodiments of the present invention have been presented for purposes of illustration and description. It is not intended to limit the invention to the precise form disclosed, and obviously many modifications and variations are possible in light of the above teaching. The exemplary embodiments were chosen and described in order to explain certain principles of the invention and its practical application to enable one skilled in the art to make and use various exemplary embodiments of the invention and various alternatives and modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims and their equivalents.

Claims

1. A speech enhancement post-processing method based on harmonic structure prediction is characterized by comprising the following steps:

2. The method for speech enhancement post-processing based on harmonic structure prediction according to claim 1, wherein said step S1 is preceded by the steps of: acquiring a voice signal x (n) of a microphone;

3. the harmonic structure prediction-based speech enhancement post-processing method according to claim 2, wherein the step S2 specifically comprises the following steps:

The specific calculation process is as follows:

s202: for the masked time domain signal

The specific calculation formula is as follows:

the power spectral density estimation process is as follows:

ρ_v(k)＝αρ_v(k)+(1-α)(1-M(l，k))|X(l，k)|²

ρ_s(k)＝αρ_s(k)+(1-α)M(l，k)|X(l，k)|²

4. the harmonic structure prediction based speech enhancement post-processing method according to claim 3, wherein in step S3, for each frequency band k, the time-frequency masking value G (l, k) is estimated as follows:

5. the method for speech enhancement post-processing based on harmonic structure prediction according to claim 4, characterized in that in step S4, the frequency domain estimation of the target speech is

The acquisition process is as follows:

y^(″l，k″)＝G(l，k)X(l，k)，

6. a speech enhancement post-processing device based on harmonic structure prediction is characterized by comprising a signal decomposition module, a loss estimation module, a masking calculation module and a speech synthesis module:

7. The harmonic structure prediction-based speech enhancement post-processing apparatus according to claim 6, wherein the signal decomposition module is further configured to obtain a speech signal x (n) of a microphone;

8. the harmonic structure prediction-based speech enhancement post-processing device according to claim 7, wherein the masking calculation module is configured to calculate the masked time-domain signal using the time-frequency masking value M (l, k) estimated by deep learning

The specific calculation process is as follows:

the harmonic loss calculation module is used for calculating the masked harmonic lossOf the time domain signal

The specific calculation formula is as follows:

the power spectral density estimation process is as follows:

ρ_v(k)＝αρ_v(k)+(1-α)(1-M(l，k))|X(l，k)|²

ρ_s(k)＝αρ_s(k)+(1-α)M(l，k)|X(l，k)|²

9. the harmonic structure prediction based speech enhancement post-processing device according to claim 8, wherein the masking computation module estimates a time-frequency masking value G (l, k) for each frequency band k by the following estimation process:

10. the harmonic structure prediction based speech enhancement post-processing apparatus as claimed in claim 9, wherein in the speech synthesis module, the frequency domain estimation of the target speech

The acquisition process is as follows:

y^(″l，k″)＝G(l，k)X(l，k)，