CN112562710A - Stepped voice enhancement method based on deep learning - Google Patents
Stepped voice enhancement method based on deep learning Download PDFInfo
- Publication number
- CN112562710A CN112562710A CN202011359400.3A CN202011359400A CN112562710A CN 112562710 A CN112562710 A CN 112562710A CN 202011359400 A CN202011359400 A CN 202011359400A CN 112562710 A CN112562710 A CN 112562710A
- Authority
- CN
- China
- Prior art keywords
- signal
- perception
- voice
- order
- speech
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 26
- 238000013135 deep learning Methods 0.000 title claims abstract description 13
- 230000008447 perception Effects 0.000 claims abstract description 33
- 238000001228 spectrum Methods 0.000 claims description 12
- 230000004913 activation Effects 0.000 claims description 6
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 230000002708 enhancing effect Effects 0.000 claims description 4
- 238000000926 separation method Methods 0.000 claims description 4
- 230000002596 correlated effect Effects 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims description 3
- 238000009432 framing Methods 0.000 claims description 3
- 230000006870 function Effects 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 230000002194 synthesizing effect Effects 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 230000001131 transforming effect Effects 0.000 claims description 3
- 230000003993 interaction Effects 0.000 description 6
- 230000000694 effects Effects 0.000 description 4
- 238000011156 evaluation Methods 0.000 description 4
- 238000011161 development Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 241000238558 Eucarida Species 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000003672 processing method Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000001629 suppression Effects 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 241000592183 Eidolon Species 0.000 description 1
- 208000032041 Hearing impaired Diseases 0.000 description 1
- 241000282412 Homo Species 0.000 description 1
- 230000000875 corresponding effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 239000007943 implant Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000035807 sensation Effects 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 239000004575 stone Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L2021/02087—Noise filtering the noise being separate speech, e.g. cocktail party
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The invention relates to a stepped voice enhancement method based on deep learning, which is characterized by comprising the following steps: the method comprises the following steps: a) establishing a first-order proper machine perception framework; b) the second order fits the human auditory perception framework. The invention has scientific and reasonable design, and can obtain the voice suitable for machine perception and the voice suitable for human ear perception based on the effectiveness of the stepped method.
Description
Technical Field
The invention belongs to the field of voice signal processing, relates to a signal processing technology, and particularly relates to a stepped voice enhancement method based on deep learning.
Background
Speech is the most basic and important way for humans to communicate information to each other, speech is the human-specific ability, and sound is the most important tool for speech to communicate information. Human-computer interaction is an important step in future development nowadays, and the situation of voice as a human-computer interaction door face is more and more important now, so that for an excellent human-computer interaction system, an excellent voice front end plays an irreplaceable role. In real life, the environment where people are located is full of noise all the time, so that a better background noise suppression scheme can be more beneficial to human-computer interaction and human-human interaction.
Automatic Speech Recognition (ASR) \\ cite { heres 2006review } is the conversion of Speech signals into text information using signal processing and pattern Recognition techniques. Nowadays, more and more products use a voice recognition technology, and speech recognition is the most important basic stone for future human-computer interaction to a certain extent. With the development of science and technology in more than ten years, particularly in 2009, the foundation of landing of voice recognition is laid by means of deep learning of computer computing power and machine learning and rapid development of big data technology, and remarkable achievements are obtained in the market. Various voice services are developed by large internet companies in a first-terrorist way, and particularly, voice recognition of a mobile terminal of a mobile phone is deep in daily life of people.
The voice productization is highly valued at home and abroad, and Siri developed by apple Inc. is a voice assistant embedded in each device of apple Inc. abroad. The user himself can query the real-time information by means of Siri invoking the system own application. Furthermore, asistat (google voice Assistant), Cortana (microsoft voice Assistant) and Alexa (amazon voice Assistant) are all the more perfect voice systems in the foreign systems. Although the research on voice recognition in domestic research is not as early as abroad, the domestic voice market reaches unprecedented prosperity nowadays, and the main product carrier is an intelligent sound box which is widely applied nowadays, such as a tianmao eidolon, a small intelligent sound box and the like. However, due to the variability of the actual scene, voice recognition cannot have a good effect in many scenes, for example, the small-sized moxa sound box used by people at present cannot make corresponding judgment accurately in a noisy scene, the vehicle-mounted voice recognition system cannot well recognize the speaking content in a scene with a high vehicle speed, and for example, the cochlear implant cannot well adapt to noise in a noisy scene, so that the hearing-impaired people cannot well judge the speaking content of others according to the auditory sense, and the like. Therefore, the speech signal level under a complex scene is improved, and the speech recognition preparation rate and the human ear perception judgment level can be improved to a great extent.
The key technology for improving the voice quality is to enhance the perception intensity of a target signal, namely, voice enhancement, which essentially suppresses the expression of a noise signal, thereby enhancing the target voice content. On one hand, the speech enhancement can better express the perception intensity of a target signal in a noisy speech signal so as to improve the auditory perception intensity of the noisy signal and improve the intelligibility of the noisy signal, and on the other hand, the robustness of the enhanced speech in other applications can be improved. Nowadays, deep learning is more and more loud, and brings very large gain in the signal field compared with the traditional signal processing method. In particular, compared with the traditional signal processing method, the deep learning shows a good voice separation effect on the suppression of non-static noise.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a stepped voice enhancement method based on deep learning, which can obtain an enhanced voice signal suitable for machine perception and further obtain a voice signal suitable for human ear auditory perception on the basis.
The technical problem to be solved by the invention is realized by the following technical scheme:
a stepped voice enhancement method based on deep learning is characterized in that: the method comprises the following steps:
a) establishment of a first-order fit machine-aware framework:
step S1, framing and windowing the voice signal, then performing short-time Fourier transform, and transforming the time sequence signal into a frequency domain signal;
step S2, designing a heuristic input mode, concentrating the input in a low-frequency area, enhancing the perception intensity of the low-frequency area, and specifically expressing the following formula:
Y=(Yi×signal(Yi,…,Yn),Yi,…,Yn)
Yirepresenting a single frame special frame in an original voice signal, and representing signal activation;
and step S3, the activated features are sent into a first-order network, the network structure is formed by a residual DNN, and all the activation functions adopt ReLU to perform high-dimensional spatial noise separation.
Step S4, mapping the result of step S3 into 1025-dimensional enhanced linear characteristics, and then combining the characteristics with the phase of the original noisy speech to obtain a speech signal which is first-order enhanced and suitable for machine perception;
b) the second-order fit human ear auditory perception framework is established:
step S5, utilizing the result of the first-order frame and adopting a stacked residual DNN network structure to further perform signal enhancement conversion, and converting the signal into a Mel spectrum filtering signal designed based on human ear perception;
step S6, inputting the result obtained by the previous layer into two layers of BGRUs, so that the information between the frames is correlated, the independence between the frames is eliminated, and simultaneously, the frequency domain information is converted to time frequency linkage;
and step S7, obtaining the Mel characteristic spectrum value of (n,80) through linear transformation, wherein the characteristic spectrum is the characteristic spectrum value after second-order enhancement, and synthesizing the voice signal suitable for human ear listening perception by using a WaveRNN vocoder.
Furthermore, the hamming window with the window length of 46.5 ms, which is added in step S1, includes 1024 samples, the frame-to-frame coverage is 75%, includes 768 samples, and the time-series speech information is mapped to 1025 dimensions using 2048-dimensional fourier transform.
Moreover, the residual DNN structure is 4-layer DNN, the input of each layer is from the previous layer and the upper layer, each layer is 1024 dimensions, and the residual is expressed as follows:
F(x)=H(x)+x
where x represents the vector of the input, H (x) is the output of the deep layer, and F (x) represents the input after the short-link transform of the different layers.
The invention has the advantages and beneficial effects that:
1. in the invention, a step-type speech enhancement network is adopted, and a step-type training scheme is adopted, wherein a first-order enhancement framework based on machine listening sensation is trained firstly, then a first-order model is frozen, a second-order enhancement framework based on human ear perception is trained, and finally the first-order enhancement framework and the second-order enhancement framework are combined for training and fine tuning the model; the experimental effect shows that the voice enhancement method based on the step type not only shows a good machine auditory sense noise reduction effect on a first-order result, but also obtains good voice based on human auditory sense in a second-order process.
2. Compared with other methods, the method provided by the invention has the advantages that a better result is obtained, and the result obtained on the subjective evaluation of people is obviously better than the result of machine perception, so that the effectiveness of the stepped method provided by the invention is proved, and the voice suitable for machine perception and the voice suitable for human ear perception can be obtained.
Drawings
FIG. 1 is a flow chart of the steps of the present invention;
fig. 2 is a histogram of the subjective perception scores of the inventors.
Detailed Description
The present invention is further illustrated by the following specific examples, which are intended to be illustrative, not limiting and are not intended to limit the scope of the invention.
A stepped voice enhancement method based on deep learning is characterized in that: the method comprises the following steps:
a) establishment of a first-order fit machine-aware framework:
step S1, framing and windowing the voice signal, then performing short-time Fourier transform, and transforming the time sequence signal into a frequency domain signal;
step S2, designing a heuristic input mode, concentrating the input in a low-frequency area, enhancing the perception intensity of the low-frequency area, and specifically expressing the following formula:
Y=(Yi×signal(Yi,…,Yn),Yi,…,Yn)
Yirepresenting a single frame special frame in an original voice signal, and representing signal activation;
and step S3, the activated features are sent into a first-order network, the network structure is formed by a residual DNN, and all the activation functions adopt ReLU to perform high-dimensional spatial noise separation.
Step S4, mapping the result of step S3 into 1025-dimensional enhanced linear characteristics, and then combining the characteristics with the phase of the original noisy speech to obtain a speech signal which is first-order enhanced and suitable for machine perception;
b) the second-order fit human ear auditory perception framework is established:
step S5, utilizing the result of the first-order frame and adopting a stacked residual DNN network structure to further perform signal enhancement conversion, and converting the signal into a Mel spectrum filtering signal designed based on human ear perception;
step S6, inputting the result obtained by the previous layer into two layers of BGRUs, so that the information between the frames is correlated, the independence between the frames is eliminated, and simultaneously, the frequency domain information is converted to time frequency linkage;
and step S7, obtaining the Mel characteristic spectrum value of (n,80) through linear transformation, wherein the characteristic spectrum is the characteristic spectrum value after second-order enhancement, and synthesizing the voice signal suitable for human ear listening perception by using a WaveRNN vocoder.
Furthermore, the hamming window with the window length of 46.5 ms, which is added in step S1, includes 1024 samples, the frame-to-frame coverage is 75%, includes 768 samples, and the time-series speech information is mapped to 1025 dimensions using 2048-dimensional fourier transform.
Moreover, the residual DNN structure is 4-layer DNN, the input of each layer is from the previous layer and the upper layer, each layer is 1024 dimensions, and the residual is expressed as follows:
F(x)=H(x)+x
where x represents the vector of the input, H (x) is the output of the deep layer, and F (x) represents the input after the short-link transform of the different layers.
To better verify the effectiveness of the method, the DNN-based speech enhancement method proposed by Xun et al, the CNN-based speech enhancement model proposed by Tomas et al, the LSTM-based speech enhancement model proposed by Chen et al, and the GRU-based speech enhancement model proposed by Bai et al were compared.
The experimental Evaluation indexes are a Speech Quality sensory Evaluation (PESQ), Short Time Objective Intelligibility (STOI) and Mean Opinion Score (MOS).
Tables 2 and 3 show the parameters of the first-order network and the second-order network, respectively, and table 1 shows the experimental verification results.
Table 1 experimental verification results
TABLE 2 first order network parameters
TABLE 3 second-order network parameters
As can be seen from the results, the results obtained by the method of the invention are better than those obtained by other methods, and the results obtained by subjective evaluation of people are obviously better than those obtained by machine perception. The effectiveness of the method based on the ladder type is explained, and the voice suitable for machine perception and human ear perception can be obtained.
Although the embodiments of the present invention and the accompanying drawings are disclosed for illustrative purposes, those skilled in the art will appreciate that: various substitutions, changes and modifications are possible without departing from the spirit and scope of the invention and the appended claims, and therefore the scope of the invention is not limited to the disclosure of the embodiments and the accompanying drawings.
Claims (3)
1. A stepped voice enhancement method based on deep learning is characterized in that: the method comprises the following steps:
a) establishment of a first-order fit machine-aware framework:
step S1, framing and windowing the voice signal, then performing short-time Fourier transform, and transforming the time sequence signal into a frequency domain signal;
step S2, designing a heuristic input mode, concentrating the input in a low-frequency area, enhancing the perception intensity of the low-frequency area, and specifically expressing the following formula:
Y=(Yi×signal(Yi,…,Yn),Yi,…,Yn)
Yirepresenting a single frame special frame in an original voice signal, and representing signal activation;
and step S3, the activated features are sent into a first-order network, the network structure is formed by a residual DNN, and all the activation functions adopt ReLU to perform high-dimensional spatial noise separation.
Step S4, mapping the result of step S3 into 1025-dimensional enhanced linear characteristics, and then combining the characteristics with the phase of the original noisy speech to obtain a speech signal which is first-order enhanced and suitable for machine perception;
b) the second-order fit human ear auditory perception framework is established:
step S5, utilizing the result of the first-order frame and adopting a stacked residual DNN network structure to further perform signal enhancement conversion, and converting the signal into a Mel spectrum filtering signal designed based on human ear perception;
step S6, inputting the result obtained by the previous layer into two layers of BGRUs, so that the information between the frames is correlated, the independence between the frames is eliminated, and simultaneously, the frequency domain information is converted to time frequency linkage;
and step S7, obtaining the Mel characteristic spectrum value of (n,80) through linear transformation, wherein the characteristic spectrum is the characteristic spectrum value after second-order enhancement, and synthesizing the voice signal suitable for human ear listening perception by using a WaveRNN vocoder.
2. The deep learning based stepped speech enhancement method of claim 1, wherein: the windowed hamming window in step S1 has a window length of 46.5 ms, and includes 1024 samples, the frame-to-frame coverage is 75%, and includes 768 samples, and the time-series speech information is mapped to 1025 dimensions using 2048-dimensional fourier transform.
3. The deep learning based stepped speech enhancement method of claim 1, wherein: the residual difference type DNN structure is 4 layers of DNN, the input of each layer is from the previous layer and the upper layer, each layer is 1024 dimensions, and the residual error is expressed as follows:
F(x)=H(x)+x
where x represents the vector of the input, H (x) is the output of the deep layer, and F (x) represents the input after the short-link transform of the different layers.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011359400.3A CN112562710B (en) | 2020-11-27 | 2020-11-27 | Stepped voice enhancement method based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011359400.3A CN112562710B (en) | 2020-11-27 | 2020-11-27 | Stepped voice enhancement method based on deep learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112562710A true CN112562710A (en) | 2021-03-26 |
CN112562710B CN112562710B (en) | 2022-09-30 |
Family
ID=75046395
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011359400.3A Active CN112562710B (en) | 2020-11-27 | 2020-11-27 | Stepped voice enhancement method based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112562710B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1737906A (en) * | 2004-03-23 | 2006-02-22 | 哈曼贝克自动系统-威美科公司 | Isolating speech signals utilizing neural networks |
CN110648684A (en) * | 2019-07-02 | 2020-01-03 | 中国人民解放军陆军工程大学 | Bone conduction voice enhancement waveform generation method based on WaveNet |
US20200312343A1 (en) * | 2019-04-01 | 2020-10-01 | Qnap Systems, Inc. | Speech enhancement method and system |
-
2020
- 2020-11-27 CN CN202011359400.3A patent/CN112562710B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1737906A (en) * | 2004-03-23 | 2006-02-22 | 哈曼贝克自动系统-威美科公司 | Isolating speech signals utilizing neural networks |
US20200312343A1 (en) * | 2019-04-01 | 2020-10-01 | Qnap Systems, Inc. | Speech enhancement method and system |
CN110648684A (en) * | 2019-07-02 | 2020-01-03 | 中国人民解放军陆军工程大学 | Bone conduction voice enhancement waveform generation method based on WaveNet |
Non-Patent Citations (1)
Title |
---|
RANYA ALOUFI ET AL.: "《Privacy-preserving Voice Analysis via Disentangled Representations》", 《ARXIV:2007.15064V2》 * |
Also Published As
Publication number | Publication date |
---|---|
CN112562710B (en) | 2022-09-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110085245B (en) | Voice definition enhancing method based on acoustic feature conversion | |
Latif et al. | Adversarial machine learning and speech emotion recognition: Utilizing generative adversarial networks for robustness | |
CN111916101B (en) | Deep learning noise reduction method and system fusing bone vibration sensor and double-microphone signals | |
Xiang et al. | A nested u-net with self-attention and dense connectivity for monaural speech enhancement | |
CN105206271A (en) | Intelligent equipment voice wake-up method and system for realizing method | |
CN109215665A (en) | A kind of method for recognizing sound-groove based on 3D convolutional neural networks | |
CN109215674A (en) | Real-time voice Enhancement Method | |
CN105741849A (en) | Voice enhancement method for fusing phase estimation and human ear hearing characteristics in digital hearing aid | |
CN105448302B (en) | A kind of the speech reverberation removing method and system of environment self-adaption | |
CN108564965B (en) | Anti-noise voice recognition system | |
CN103761974B (en) | Cochlear implant | |
CN115602165B (en) | Digital employee intelligent system based on financial system | |
CN106024010A (en) | Speech signal dynamic characteristic extraction method based on formant curves | |
CN110136709A (en) | Audio recognition method and video conferencing system based on speech recognition | |
CN107274887A (en) | Speaker's Further Feature Extraction method based on fusion feature MGFCC | |
Yuliani et al. | Speech enhancement using deep learning methods: A review | |
CN104778948B (en) | A kind of anti-noise audio recognition method based on bending cepstrum feature | |
CN109599094A (en) | The method of sound beauty and emotion modification | |
CN113035203A (en) | Control method for dynamically changing voice response style | |
CN112562710B (en) | Stepped voice enhancement method based on deep learning | |
WO2012159370A1 (en) | Voice enhancement method and device | |
Li et al. | A mapping model of spectral tilt in normal-to-Lombard speech conversion for intelligibility enhancement | |
CN114550701A (en) | Deep neural network-based Chinese electronic larynx voice conversion device and method | |
CN112614502B (en) | Echo cancellation method based on double LSTM neural network | |
Kashani et al. | Speech Enhancement via Deep Spectrum Image Translation Network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |