CN114067825A - Comfort noise generation method based on time-frequency masking estimation and application thereof - Google Patents
Comfort noise generation method based on time-frequency masking estimation and application thereof Download PDFInfo
- Publication number
- CN114067825A CN114067825A CN202111360253.6A CN202111360253A CN114067825A CN 114067825 A CN114067825 A CN 114067825A CN 202111360253 A CN202111360253 A CN 202111360253A CN 114067825 A CN114067825 A CN 114067825A
- Authority
- CN
- China
- Prior art keywords
- time
- frequency
- noise
- estimation
- comfort noise
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000000873 masking effect Effects 0.000 title claims abstract description 40
- 238000000034 method Methods 0.000 title claims abstract description 36
- 238000001228 spectrum Methods 0.000 claims abstract description 31
- 230000002194 synthesizing effect Effects 0.000 claims abstract description 8
- 230000003595 spectral effect Effects 0.000 claims description 34
- 238000009499 grossing Methods 0.000 claims description 9
- 230000007613 environmental effect Effects 0.000 claims description 5
- 238000000354 decomposition reaction Methods 0.000 claims description 4
- 230000015572 biosynthetic process Effects 0.000 claims description 3
- 238000003786 synthesis reaction Methods 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 2
- 238000012545 processing Methods 0.000 abstract description 10
- 238000013135 deep learning Methods 0.000 abstract description 6
- 238000010586 diagram Methods 0.000 description 9
- 238000004590 computer program Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 5
- 230000001629 suppression Effects 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 241001497337 Euscorpius gamma Species 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000007717 exclusion Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000002401 inhibitory effect Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- SYHGEUNFJIGTRX-UHFFFAOYSA-N methylenedioxypyrovalerone Chemical compound C=1C=C2OCOC2=CC=1C(=O)C(CCC)N1CCCC1 SYHGEUNFJIGTRX-UHFFFAOYSA-N 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0224—Processing in the time domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/21—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M3/00—Automatic or semi-automatic exchanges
- H04M3/42—Systems providing special services or facilities to subscribers
- H04M3/56—Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities
- H04M3/568—Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities audio processing specific to telephonic conferencing, e.g. spatial distribution, mixing of participants
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L2021/02161—Number of inputs available containing the signal or the noise to be suppressed
- G10L2021/02166—Microphone arrays; Beamforming
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Quality & Reliability (AREA)
- Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
Abstract
The invention relates to the technical field of noise processing, and particularly discloses a comfort noise generation method based on time-frequency masking estimation and application thereof, wherein the method comprises the following steps: s1, converting the time domain signal X (n) picked up by the microphone element into a time-frequency domain signal, and obtaining the frequency spectrum X (l, k) of the microphone signal in the ith frame and the kth frequency band; s2, estimating the power spectrum density of the comfort noise; s3, generating comfortable noise with corresponding energy; s4, synthesizing the target speech. The scheme is characterized in that the stationary noise component is estimated based on the time-frequency masking information obtained based on deep learning, so that the phenomenon that excessive comfortable noise is generated due to the fact that voice energy is accumulated to the stationary noise component can be avoided; on the other hand, the comfort is selectively increased for the time frequency unit, and the noise introduced by the voice leading time frequency unit is avoided.
Description
Technical Field
The present invention relates to the field of noise processing technologies, and in particular, to a comfort noise generation method based on time-frequency masking estimation and an application thereof.
Background
Noise suppression and speech enhancement have been key techniques for improving the quality of speech communication in conferencing systems or conferencing equipment. The traditional signal processing method is to track the noise power spectral density and the voice power spectral density in a signal, then construct a masking value of 0 to 1 in a frequency domain based on wiener filtering, and achieve the purpose of inhibiting background noise after masking a microphone signal. The signal processing technology has the disadvantages that non-stationary noise in the environment cannot be effectively processed, and the voice distortion is too large under strong noise interference. At present, time-frequency masking information estimation based on deep learning is another common method for noise suppression, and the main idea is to directly estimate a time-frequency masking value from a mixed signal by training a noisy data set to a pure voice signal. The deep learning method can better process non-stationary noise, but also has the distortion problem of speech over-cancellation.
Therefore, in summary, the main disadvantages of the prior art are:
the method for tracking the stationary component in the background noise by signal processing has the defect of overlarge comfort noise in the scene with larger environmental noise.
The existing comfort noise generation method is to add noise to all time-frequency units without adding any distinction, which causes that a certain amount of noise is added to the time-frequency area dominated by voice.
How to significantly improve the quality of hearing perception in a range of generating moderate comfortable noise is a difficult problem to be solved urgently.
The existing scheme combines stationary component estimation in environmental noise, and then generates white noise with the same energy to be added into a frequency spectrum so as to weaken the influence of voice distortion on listening perception.
The information disclosed in this background section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.
Disclosure of Invention
The invention aims to provide a comfort noise generation method based on time-frequency masking estimation and application thereof, which can improve the communication quality, noise suppression and voice enhancement applied to a voice conference system and the like.
In order to achieve the above object, the present invention provides a comfort noise generation method based on time-frequency masking estimation, comprising the steps of:
s1, converting the time domain signal X (n) picked up by the microphone element into a time-frequency domain signal, and obtaining the frequency spectrum X (l, k) of the microphone signal in the ith frame and the kth frequency band;
s2, estimating the power spectrum density of the comfort noise;
s3, generating comfortable noise with corresponding energy;
s4, synthesizing the target speech.
In a specific implementation scenario, in a first step, the microphone array signal is first subjected to signal decomposition. And Fourier transform is adopted to convert the suitable signal into a frequency domain signal to obtain a frequency spectrum, so that subsequent noise processing is facilitated. And secondly, carrying out comfort noise power spectral density estimation on the frequency domain signal, wherein the step sequentially comprises three steps of noise power spectral density estimation, stationary noise power spectral density estimation and comfort noise energy estimation. Wherein, the noise power spectral density estimation adopts the video masking information in the prior art. Third, a comfort noise spectrum is generated. Play the effect of starting and stopping, make things convenient for subsequent pronunciation further processing analysis. And fourthly, estimating the target voice. The method comprises the steps of target voice frequency domain estimation, after estimation, inverse Fourier transform is carried out to obtain a target time domain signal, namely a target voice signal, and the target time domain signal is output.
Alternatively, the calculation formula of the frequency spectrum X (l, k) in S1 is as follows:
wherein, N is the frame length 512, w (N) is the hamming window of the frame length 512, N is the time label, l is the time frame number, k is the frequency number, and X (l, k) is the frequency spectrum of the microphone signal in the kth frame and the kth frequency band.
Optionally, the S2 specifically includes:
s21, obtaining a time-frequency masking value M (l, k), and calculating the power spectral density rho of the environmental noise for each frequency band kv(k):
Wherein | | | represents taking the modulus of the complex number, alpha is the smoothing factor between adjacent frames, the value range is between 0 and 1;
s22, estimating stationary noise power spectral density rho for each frequency band kmin(k):
The stationary noise power spectral density represents a minimum component in the tracked noise, namely a minimum value of a noise component in the signal, alpha is a smoothing factor which is the same as the value obtained in the step S21, and gamma represents that the stationary noise control factor value is less than 1;
s23, calculating comfort noise energy ζ:
where K has a value equal to half the frame length.
Optionally, the smoothing factor α is 0.95.
Optionally, the stationary noise control factor γ is 0.08.
Optionally, the S3 specifically includes:
generating a comfort noise power spectrum v (l, k):
where σ (n) is a white noise sequence with energy of 1 and length of 512.
Optionally, the S4 specifically includes:
s41, calculating the frequency domain estimation of the target voice according to the following formula:
the system comprises a frequency spectrum X (l, k), a time-frequency masking value M (l, k), and a comfort noise power frequency spectrum upsilon (l, k);
s42, performing inverse Fourier transform to obtain target voice time domain estimation:
where w (k) is the hamming window of frame length 512.
The invention also provides a comfort noise generation system based on time-frequency masking estimation, which is used for implementing a comfort noise generation method based on time-frequency masking estimation, and comprises the following steps:
the signal decomposition module is used for converting a time domain signal X (n) picked up by the microphone element into a time-frequency domain signal and obtaining a frequency spectrum X (l, k) of the microphone signal in the ith frame and the kth frequency band;
a comfort noise power spectral density estimation module for estimating a comfort noise power spectral density;
the comfortable noise generating module is used for generating comfortable noise with corresponding energy;
and the target voice synthesis module is used for synthesizing the target voice.
Optionally, the comfort noise power spectral density estimation module includes a noise power spectral density estimation module, a stationary noise power spectral density estimation module, and a comfort noise energy estimation module, which sequentially process the signal.
The invention also provides an electronic device comprising a memory and a processor, wherein the processor is used for realizing the steps of the comfort noise generation method based on the time-frequency masking estimation when executing the computer management program stored in the memory.
Compared with the prior art, the comfort noise generation method based on the time-frequency masking estimation and the application thereof provided by the invention comprise the following steps: s1, converting the time domain signal X (n) picked up by the microphone element into a time-frequency domain signal, and obtaining the frequency spectrum X (l, k) of the microphone signal in the ith frame and the kth frequency band; s2, estimating the power spectrum density of the comfort noise; s3, generating comfortable noise with corresponding energy; s4, synthesizing the target speech. The scheme is characterized in that the stationary noise component is estimated based on the time-frequency masking information obtained based on deep learning, so that the phenomenon that excessive comfortable noise is generated due to the fact that voice energy is accumulated to the stationary noise component can be avoided; on the other hand, the comfort is selectively increased for the time frequency unit, and the noise introduced by the voice leading time frequency unit is avoided.
Drawings
Fig. 1 is a functional block diagram of a comfort noise generation system based on time-frequency masking estimation according to the present invention.
Detailed Description
The following detailed description of the present invention is provided in conjunction with the accompanying drawings, but it should be understood that the scope of the present invention is not limited to the specific embodiments.
Throughout the specification and claims, unless explicitly stated otherwise, the word "comprise", or variations such as "comprises" or "comprising", will be understood to imply the inclusion of a stated element or component but not the exclusion of any other element or component.
Example one
A comfort noise generation method based on time-frequency masking estimation according to the preferred embodiment of the present invention comprises the following steps:
s1, converting the time domain signal X (n) picked up by the microphone element into a time-frequency domain signal, and obtaining the frequency spectrum X (l, k) of the microphone signal in the ith frame and the kth frequency band;
s2, estimating the power spectrum density of the comfort noise;
s3, generating comfortable noise with corresponding energy;
s4, synthesizing the target speech.
In a preferred embodiment, the spectrum X (l, k) in S1 is calculated as follows:
obtaining a time-frequency domain representation by performing a short-time fourier transform on the time-domain signal x (n):
wherein, N is the frame length 512, w (N) is the hamming window with the length 512, N is the time label, l is the time frame number, and k is the frequency number. X (l, k) is the spectrum of the microphone signal in the kth frequency band, the l frame. A frequency band refers to a signal component corresponding to a certain frequency. And obtaining a Hamming window value corresponding to each sample time point n based on the Hamming window function. The hamming window is a fixed prior art and will not be described further herein.
Example two
The present embodiment is the same as the comfort noise generation method based on time-frequency masking estimation in the first embodiment, and the different places are as follows:
s2 specifically includes: assuming that the time-frequency masking value estimated by deep learning is M (l, k), the power spectral density rho of the environmental noise is calculated for each frequency band kv(k):
Wherein | | | represents taking the modulus of the complex number, alpha is the smoothing factor between adjacent frames, the value range is between 0 and 1. The invention preferably selects alpha to be 0.95, if the value is too small, the power spectral density estimation has the defect of unstable variation amplitude, and if the value is too high, the energy estimation is too smooth, and the modeling capability of non-smooth noise is reduced. The prior art M (l, k) is adopted, and is a masking value obtained through deep learning estimation, and the value range is between 0 and 1. And when M (l, k) is less than 0.5, the time frequency unit is represented as an environment noise dominant time frequency unit, the noise power spectral density can be updated, otherwise, the time frequency unit is represented as a voice dominant time unit, and the updating of the noise power spectral density is stopped. The result of this step is used to update the stationary noise power spectral density as (3) below.
Estimating stationary noise power spectral density ρ for each frequency band kmin(k):
This power spectral density represents a minimal component in the tracked noise, i.e. a minimal value of the noise component in the signal, without directly smoothing the microphone signal. Where α is the same smoothing factor as in step 2. Gamma represents the stationary noise control factor and since stationary noise is part of the noise, the stationary noise energy is less than the noise energy, i.e. gamma < 1. The control factor value is gamma 0.08, and the adoption of the value can generate proper comfortable noise energy and avoid overlarge comfortable noise. This step is used to calculate the comfort noise energy.
Calculating comfort noise energy ζ:
this step calculates the physical meaning that the value ρ for all the frequency bands kmin(k) Averaging, where K represents the number of all bands. The value of K is equal to half the frame length, i.e. 256. This step is used for comfort noise energy in the subsequent steps.
Step S3 specifically includes: generating a comfort noise power spectrum v (l, k):
where σ (n) is a white noise sequence of energy 1 and length 512. This step is used as the next step to calculate the final speech spectrum.
Step S4 specifically includes: obtaining a frequency domain estimation of the target voice according to the comfortable noise power spectrum obtained in the step (5):
namely, when the energy after the time frequency masking is less than the energy of the comfort noise, the comfort noise is added on the time frequency unit, and the distortion caused by local energy loss is avoided. Meanwhile, if the energy after the time-frequency masking is larger than the comfortable noise energy, no noise is added, and the phenomenon that the added noise energy is too large is avoided.
Performing inverse Fourier transform to obtain target voice time domain estimation:
where w (k) is the hamming window of frame length 512.
The time domain estimated signal can be directly converted into a voltage signal through digital-to-analog conversion, and the voltage signal is played by a loudspeaker to form enhanced voice.
As shown in fig. 1, an embodiment of the present invention further provides a comfort noise generation system based on time-frequency masking estimation, where the system is configured to implement the comfort noise generation method based on time-frequency masking estimation according to the two previous embodiments, where the system includes:
the signal decomposition module is used for converting a time domain signal X (n) picked up by the microphone element into a time-frequency domain signal and obtaining a frequency spectrum X (l, k) of the microphone signal in the ith frame and the kth frequency band;
a comfort noise power spectral density estimation module for estimating a comfort noise power spectral density;
the comfortable noise generating module is used for generating comfortable noise with corresponding energy;
and the target voice synthesis module is used for synthesizing the target voice.
Specifically, the comfort noise power spectral density estimation module includes a noise power spectral density estimation module, a stationary noise power spectral density estimation module, and a comfort noise energy estimation module, which sequentially process signals.
An embodiment of the present invention further provides an electronic device, which includes a memory and a processor, where the processor is configured to implement the steps of the comfort noise generation method based on time-frequency masking estimation as described in the two previous embodiments when executing a computer management program stored in the memory.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing descriptions of specific exemplary embodiments of the present invention have been presented for purposes of illustration and description. It is not intended to limit the invention to the precise form disclosed, and obviously many modifications and variations are possible in light of the above teaching. The exemplary embodiments were chosen and described in order to explain certain principles of the invention and its practical application to enable one skilled in the art to make and use various exemplary embodiments of the invention and various alternatives and modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims and their equivalents.
Claims (10)
1. A comfort noise generation method based on time-frequency masking estimation is characterized by comprising the following steps:
s1, converting the time domain signal X (n) picked up by the microphone element into a time-frequency domain signal, and obtaining the frequency spectrum X (l, k) of the microphone signal in the ith frame and the kth frequency band;
s2, estimating the power spectrum density of the comfort noise;
s3, generating comfortable noise with corresponding energy;
s4, synthesizing the target speech.
2. The method for comfort noise generation based on time-frequency masking estimation according to claim 1, characterized in that the calculation formula of the frequency spectrum X (l, k) in S1 is as follows:
wherein, N is the frame length 512, w (N) is the hamming window of the frame length 512, N is the time label, l is the time frame number, k is the frequency number, and X (l, k) is the frequency spectrum of the microphone signal in the kth frame and the kth frequency band.
3. The method for comfort noise generation based on time-frequency masking estimation as claimed in claim 2, wherein said S2 specifically includes:
s21, obtaining a time-frequency masking value M (l, k), and calculating the power spectral density rho of the environmental noise for each frequency band kv(k):
Wherein | | | represents taking the modulus of the complex number, alpha is the smoothing factor between adjacent frames, the value range is between 0 and 1;
s22, estimating stationary noise power spectral density rho for each frequency band kmin(k):
The stationary noise power spectral density represents a minimum component in the tracked noise, namely a minimum value of a noise component in the signal, alpha is a smoothing factor which is the same as the value obtained in the step S21, and gamma represents that the stationary noise control factor value is less than 1;
s23, calculating comfort noise energy ζ:
where K has a value equal to half the frame length.
4. A comfort noise generation method based on time-frequency masking estimation according to claim 3, characterized in that the smoothing factor α is 0.95.
5. A comfort noise generation method based on time-frequency masking estimation according to claim 3, characterized in that the stationary noise control factor γ is 0.08.
7. The method for comfort noise generation based on time-frequency masking estimation as claimed in claim 6, wherein said S4 specifically includes:
s41, calculating the frequency domain estimation of the target voice according to the following formula:
the system comprises a frequency spectrum X (l, k), a time-frequency masking value M (l, k), and a comfort noise power frequency spectrum upsilon (l, k);
s42, performing inverse Fourier transform to obtain target voice time domain estimation:
where w (k) is the hamming window of frame length 512.
8. A comfort noise generation system based on time-frequency masking estimation, characterized in that the system is configured to implement the comfort noise generation method based on time-frequency masking estimation according to any one of claims 1 to 7, and comprises:
the signal decomposition module is used for converting a time domain signal X (n) picked up by the microphone element into a time-frequency domain signal and obtaining a frequency spectrum X (l, k) of the microphone signal in the ith frame and the kth frequency band;
a comfort noise power spectral density estimation module for estimating a comfort noise power spectral density;
the comfortable noise generating module is used for generating comfortable noise with corresponding energy;
and the target voice synthesis module is used for synthesizing the target voice.
9. The time-frequency mask estimation based comfort noise generation system according to claim 8, wherein the comfort noise power spectral density estimation module comprises a noise power spectral density estimation module, a stationary noise power spectral density estimation module, and a comfort noise energy estimation module, which sequentially process signals.
10. An electronic device, comprising a memory, a processor for implementing the steps of the method for comfort noise generation based on time-frequency mask estimation according to any of claims 1-7 when executing a computer management class program stored in the memory.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111360253.6A CN114067825A (en) | 2021-11-17 | 2021-11-17 | Comfort noise generation method based on time-frequency masking estimation and application thereof |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111360253.6A CN114067825A (en) | 2021-11-17 | 2021-11-17 | Comfort noise generation method based on time-frequency masking estimation and application thereof |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114067825A true CN114067825A (en) | 2022-02-18 |
Family
ID=80273356
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111360253.6A Pending CN114067825A (en) | 2021-11-17 | 2021-11-17 | Comfort noise generation method based on time-frequency masking estimation and application thereof |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114067825A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023226592A1 (en) * | 2022-05-25 | 2023-11-30 | 青岛海尔科技有限公司 | Noise signal processing method and apparatus, and storage medium and electronic apparatus |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101258057B1 (en) * | 2011-11-16 | 2013-04-24 | 한국과학기술원 | Apparatus and method for auditory masking-based adjusting the amplitude of phone ringing sounds under acoustic noise environments |
CN113030862A (en) * | 2021-03-12 | 2021-06-25 | 中国科学院声学研究所 | Multi-channel speech enhancement method and device |
CN113160845A (en) * | 2021-03-29 | 2021-07-23 | 南京理工大学 | Speech enhancement algorithm based on speech existence probability and auditory masking effect |
-
2021
- 2021-11-17 CN CN202111360253.6A patent/CN114067825A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101258057B1 (en) * | 2011-11-16 | 2013-04-24 | 한국과학기술원 | Apparatus and method for auditory masking-based adjusting the amplitude of phone ringing sounds under acoustic noise environments |
CN113030862A (en) * | 2021-03-12 | 2021-06-25 | 中国科学院声学研究所 | Multi-channel speech enhancement method and device |
CN113160845A (en) * | 2021-03-29 | 2021-07-23 | 南京理工大学 | Speech enhancement algorithm based on speech existence probability and auditory masking effect |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023226592A1 (en) * | 2022-05-25 | 2023-11-30 | 青岛海尔科技有限公司 | Noise signal processing method and apparatus, and storage medium and electronic apparatus |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108172231B (en) | Dereverberation method and system based on Kalman filtering | |
JP6703525B2 (en) | Method and device for enhancing sound source | |
JP4842583B2 (en) | Method and apparatus for multisensory speech enhancement | |
US10755728B1 (en) | Multichannel noise cancellation using frequency domain spectrum masking | |
Zhang et al. | Multi-channel multi-frame ADL-MVDR for target speech separation | |
CN106572419A (en) | Stereo sound effect enhancement system | |
CN113571047B (en) | Audio data processing method, device and equipment | |
US9530429B2 (en) | Reverberation suppression apparatus used for auditory device | |
CN109841223B (en) | Audio signal processing method, intelligent terminal and storage medium | |
Chao et al. | Perceptual contrast stretching on target feature for speech enhancement | |
CN107045874B (en) | Non-linear voice enhancement method based on correlation | |
CN114067825A (en) | Comfort noise generation method based on time-frequency masking estimation and application thereof | |
Liu et al. | Gesper: A Restoration-Enhancement Framework for General Speech Reconstruction | |
Baby | isegan: Improved speech enhancement generative adversarial networks | |
CN117219102A (en) | Low-complexity voice enhancement method based on auditory perception | |
CN113808608B (en) | Method and device for suppressing mono noise based on time-frequency masking smoothing strategy | |
Xiong et al. | Deep Subband Network for Joint Suppression of Echo, Noise and Reverberation in Real-Time Fullband Speech Communication | |
CN114360560A (en) | Speech enhancement post-processing method and device based on harmonic structure prediction | |
CN111009259A (en) | Audio processing method and device | |
Miyazaki et al. | Theoretical analysis of parametric blind spatial subtraction array and its application to speech recognition performance prediction | |
CN113299308B (en) | Voice enhancement method and device, electronic equipment and storage medium | |
Chun et al. | Comparison of cnn-based speech dereverberation using neural vocoder | |
Xiang et al. | Distributed Microphones Speech Separation by Learning Spatial Information With Recurrent Neural Network | |
Prasad et al. | Two microphone technique to improve the speech intelligibility under noisy environment | |
Muhammed Shifas et al. | Speech intelligibility enhancement based on a non-causal WaveNet-like model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |