CN112562710A - Stepped voice enhancement method based on deep learning - Google Patents

Stepped voice enhancement method based on deep learning Download PDF

Info

Publication number
CN112562710A
CN112562710A CN202011359400.3A CN202011359400A CN112562710A CN 112562710 A CN112562710 A CN 112562710A CN 202011359400 A CN202011359400 A CN 202011359400A CN 112562710 A CN112562710 A CN 112562710A
Authority
CN
China
Prior art keywords
signal
perception
voice
order
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011359400.3A
Other languages
Chinese (zh)
Other versions
CN112562710B (en
Inventor
胡静
万里
王建荣
刘李
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202011359400.3A priority Critical patent/CN112562710B/en
Publication of CN112562710A publication Critical patent/CN112562710A/en
Application granted granted Critical
Publication of CN112562710B publication Critical patent/CN112562710B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02087Noise filtering the noise being separate speech, e.g. cocktail party

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention relates to a stepped voice enhancement method based on deep learning, which is characterized by comprising the following steps: the method comprises the following steps: a) establishing a first-order proper machine perception framework; b) the second order fits the human auditory perception framework. The invention has scientific and reasonable design, and can obtain the voice suitable for machine perception and the voice suitable for human ear perception based on the effectiveness of the stepped method.

Description

Stepped voice enhancement method based on deep learning
Technical Field
The invention belongs to the field of voice signal processing, relates to a signal processing technology, and particularly relates to a stepped voice enhancement method based on deep learning.
Background
Speech is the most basic and important way for humans to communicate information to each other, speech is the human-specific ability, and sound is the most important tool for speech to communicate information. Human-computer interaction is an important step in future development nowadays, and the situation of voice as a human-computer interaction door face is more and more important now, so that for an excellent human-computer interaction system, an excellent voice front end plays an irreplaceable role. In real life, the environment where people are located is full of noise all the time, so that a better background noise suppression scheme can be more beneficial to human-computer interaction and human-human interaction.
Automatic Speech Recognition (ASR) \\ cite { heres 2006review } is the conversion of Speech signals into text information using signal processing and pattern Recognition techniques. Nowadays, more and more products use a voice recognition technology, and speech recognition is the most important basic stone for future human-computer interaction to a certain extent. With the development of science and technology in more than ten years, particularly in 2009, the foundation of landing of voice recognition is laid by means of deep learning of computer computing power and machine learning and rapid development of big data technology, and remarkable achievements are obtained in the market. Various voice services are developed by large internet companies in a first-terrorist way, and particularly, voice recognition of a mobile terminal of a mobile phone is deep in daily life of people.
The voice productization is highly valued at home and abroad, and Siri developed by apple Inc. is a voice assistant embedded in each device of apple Inc. abroad. The user himself can query the real-time information by means of Siri invoking the system own application. Furthermore, asistat (google voice Assistant), Cortana (microsoft voice Assistant) and Alexa (amazon voice Assistant) are all the more perfect voice systems in the foreign systems. Although the research on voice recognition in domestic research is not as early as abroad, the domestic voice market reaches unprecedented prosperity nowadays, and the main product carrier is an intelligent sound box which is widely applied nowadays, such as a tianmao eidolon, a small intelligent sound box and the like. However, due to the variability of the actual scene, voice recognition cannot have a good effect in many scenes, for example, the small-sized moxa sound box used by people at present cannot make corresponding judgment accurately in a noisy scene, the vehicle-mounted voice recognition system cannot well recognize the speaking content in a scene with a high vehicle speed, and for example, the cochlear implant cannot well adapt to noise in a noisy scene, so that the hearing-impaired people cannot well judge the speaking content of others according to the auditory sense, and the like. Therefore, the speech signal level under a complex scene is improved, and the speech recognition preparation rate and the human ear perception judgment level can be improved to a great extent.
The key technology for improving the voice quality is to enhance the perception intensity of a target signal, namely, voice enhancement, which essentially suppresses the expression of a noise signal, thereby enhancing the target voice content. On one hand, the speech enhancement can better express the perception intensity of a target signal in a noisy speech signal so as to improve the auditory perception intensity of the noisy signal and improve the intelligibility of the noisy signal, and on the other hand, the robustness of the enhanced speech in other applications can be improved. Nowadays, deep learning is more and more loud, and brings very large gain in the signal field compared with the traditional signal processing method. In particular, compared with the traditional signal processing method, the deep learning shows a good voice separation effect on the suppression of non-static noise.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a stepped voice enhancement method based on deep learning, which can obtain an enhanced voice signal suitable for machine perception and further obtain a voice signal suitable for human ear auditory perception on the basis.
The technical problem to be solved by the invention is realized by the following technical scheme:
a stepped voice enhancement method based on deep learning is characterized in that: the method comprises the following steps:
a) establishment of a first-order fit machine-aware framework:
step S1, framing and windowing the voice signal, then performing short-time Fourier transform, and transforming the time sequence signal into a frequency domain signal;
step S2, designing a heuristic input mode, concentrating the input in a low-frequency area, enhancing the perception intensity of the low-frequency area, and specifically expressing the following formula:
Y=(Yi×signal(Yi,…,Yn),Yi,…,Yn)
Yirepresenting a single frame special frame in an original voice signal, and representing signal activation;
and step S3, the activated features are sent into a first-order network, the network structure is formed by a residual DNN, and all the activation functions adopt ReLU to perform high-dimensional spatial noise separation.
Step S4, mapping the result of step S3 into 1025-dimensional enhanced linear characteristics, and then combining the characteristics with the phase of the original noisy speech to obtain a speech signal which is first-order enhanced and suitable for machine perception;
b) the second-order fit human ear auditory perception framework is established:
step S5, utilizing the result of the first-order frame and adopting a stacked residual DNN network structure to further perform signal enhancement conversion, and converting the signal into a Mel spectrum filtering signal designed based on human ear perception;
step S6, inputting the result obtained by the previous layer into two layers of BGRUs, so that the information between the frames is correlated, the independence between the frames is eliminated, and simultaneously, the frequency domain information is converted to time frequency linkage;
and step S7, obtaining the Mel characteristic spectrum value of (n,80) through linear transformation, wherein the characteristic spectrum is the characteristic spectrum value after second-order enhancement, and synthesizing the voice signal suitable for human ear listening perception by using a WaveRNN vocoder.
Furthermore, the hamming window with the window length of 46.5 ms, which is added in step S1, includes 1024 samples, the frame-to-frame coverage is 75%, includes 768 samples, and the time-series speech information is mapped to 1025 dimensions using 2048-dimensional fourier transform.
Moreover, the residual DNN structure is 4-layer DNN, the input of each layer is from the previous layer and the upper layer, each layer is 1024 dimensions, and the residual is expressed as follows:
F(x)=H(x)+x
where x represents the vector of the input, H (x) is the output of the deep layer, and F (x) represents the input after the short-link transform of the different layers.
The invention has the advantages and beneficial effects that:
1. in the invention, a step-type speech enhancement network is adopted, and a step-type training scheme is adopted, wherein a first-order enhancement framework based on machine listening sensation is trained firstly, then a first-order model is frozen, a second-order enhancement framework based on human ear perception is trained, and finally the first-order enhancement framework and the second-order enhancement framework are combined for training and fine tuning the model; the experimental effect shows that the voice enhancement method based on the step type not only shows a good machine auditory sense noise reduction effect on a first-order result, but also obtains good voice based on human auditory sense in a second-order process.
2. Compared with other methods, the method provided by the invention has the advantages that a better result is obtained, and the result obtained on the subjective evaluation of people is obviously better than the result of machine perception, so that the effectiveness of the stepped method provided by the invention is proved, and the voice suitable for machine perception and the voice suitable for human ear perception can be obtained.
Drawings
FIG. 1 is a flow chart of the steps of the present invention;
fig. 2 is a histogram of the subjective perception scores of the inventors.
Detailed Description
The present invention is further illustrated by the following specific examples, which are intended to be illustrative, not limiting and are not intended to limit the scope of the invention.
A stepped voice enhancement method based on deep learning is characterized in that: the method comprises the following steps:
a) establishment of a first-order fit machine-aware framework:
step S1, framing and windowing the voice signal, then performing short-time Fourier transform, and transforming the time sequence signal into a frequency domain signal;
step S2, designing a heuristic input mode, concentrating the input in a low-frequency area, enhancing the perception intensity of the low-frequency area, and specifically expressing the following formula:
Y=(Yi×signal(Yi,…,Yn),Yi,…,Yn)
Yirepresenting a single frame special frame in an original voice signal, and representing signal activation;
and step S3, the activated features are sent into a first-order network, the network structure is formed by a residual DNN, and all the activation functions adopt ReLU to perform high-dimensional spatial noise separation.
Step S4, mapping the result of step S3 into 1025-dimensional enhanced linear characteristics, and then combining the characteristics with the phase of the original noisy speech to obtain a speech signal which is first-order enhanced and suitable for machine perception;
b) the second-order fit human ear auditory perception framework is established:
step S5, utilizing the result of the first-order frame and adopting a stacked residual DNN network structure to further perform signal enhancement conversion, and converting the signal into a Mel spectrum filtering signal designed based on human ear perception;
step S6, inputting the result obtained by the previous layer into two layers of BGRUs, so that the information between the frames is correlated, the independence between the frames is eliminated, and simultaneously, the frequency domain information is converted to time frequency linkage;
and step S7, obtaining the Mel characteristic spectrum value of (n,80) through linear transformation, wherein the characteristic spectrum is the characteristic spectrum value after second-order enhancement, and synthesizing the voice signal suitable for human ear listening perception by using a WaveRNN vocoder.
Furthermore, the hamming window with the window length of 46.5 ms, which is added in step S1, includes 1024 samples, the frame-to-frame coverage is 75%, includes 768 samples, and the time-series speech information is mapped to 1025 dimensions using 2048-dimensional fourier transform.
Moreover, the residual DNN structure is 4-layer DNN, the input of each layer is from the previous layer and the upper layer, each layer is 1024 dimensions, and the residual is expressed as follows:
F(x)=H(x)+x
where x represents the vector of the input, H (x) is the output of the deep layer, and F (x) represents the input after the short-link transform of the different layers.
To better verify the effectiveness of the method, the DNN-based speech enhancement method proposed by Xun et al, the CNN-based speech enhancement model proposed by Tomas et al, the LSTM-based speech enhancement model proposed by Chen et al, and the GRU-based speech enhancement model proposed by Bai et al were compared.
The experimental Evaluation indexes are a Speech Quality sensory Evaluation (PESQ), Short Time Objective Intelligibility (STOI) and Mean Opinion Score (MOS).
Tables 2 and 3 show the parameters of the first-order network and the second-order network, respectively, and table 1 shows the experimental verification results.
Table 1 experimental verification results
Figure BDA0002803584070000051
TABLE 2 first order network parameters
Figure BDA0002803584070000052
TABLE 3 second-order network parameters
Figure BDA0002803584070000061
As can be seen from the results, the results obtained by the method of the invention are better than those obtained by other methods, and the results obtained by subjective evaluation of people are obviously better than those obtained by machine perception. The effectiveness of the method based on the ladder type is explained, and the voice suitable for machine perception and human ear perception can be obtained.
Although the embodiments of the present invention and the accompanying drawings are disclosed for illustrative purposes, those skilled in the art will appreciate that: various substitutions, changes and modifications are possible without departing from the spirit and scope of the invention and the appended claims, and therefore the scope of the invention is not limited to the disclosure of the embodiments and the accompanying drawings.

Claims (3)

1. A stepped voice enhancement method based on deep learning is characterized in that: the method comprises the following steps:
a) establishment of a first-order fit machine-aware framework:
step S1, framing and windowing the voice signal, then performing short-time Fourier transform, and transforming the time sequence signal into a frequency domain signal;
step S2, designing a heuristic input mode, concentrating the input in a low-frequency area, enhancing the perception intensity of the low-frequency area, and specifically expressing the following formula:
Y=(Yi×signal(Yi,…,Yn),Yi,…,Yn)
Yirepresenting a single frame special frame in an original voice signal, and representing signal activation;
and step S3, the activated features are sent into a first-order network, the network structure is formed by a residual DNN, and all the activation functions adopt ReLU to perform high-dimensional spatial noise separation.
Step S4, mapping the result of step S3 into 1025-dimensional enhanced linear characteristics, and then combining the characteristics with the phase of the original noisy speech to obtain a speech signal which is first-order enhanced and suitable for machine perception;
b) the second-order fit human ear auditory perception framework is established:
step S5, utilizing the result of the first-order frame and adopting a stacked residual DNN network structure to further perform signal enhancement conversion, and converting the signal into a Mel spectrum filtering signal designed based on human ear perception;
step S6, inputting the result obtained by the previous layer into two layers of BGRUs, so that the information between the frames is correlated, the independence between the frames is eliminated, and simultaneously, the frequency domain information is converted to time frequency linkage;
and step S7, obtaining the Mel characteristic spectrum value of (n,80) through linear transformation, wherein the characteristic spectrum is the characteristic spectrum value after second-order enhancement, and synthesizing the voice signal suitable for human ear listening perception by using a WaveRNN vocoder.
2. The deep learning based stepped speech enhancement method of claim 1, wherein: the windowed hamming window in step S1 has a window length of 46.5 ms, and includes 1024 samples, the frame-to-frame coverage is 75%, and includes 768 samples, and the time-series speech information is mapped to 1025 dimensions using 2048-dimensional fourier transform.
3. The deep learning based stepped speech enhancement method of claim 1, wherein: the residual difference type DNN structure is 4 layers of DNN, the input of each layer is from the previous layer and the upper layer, each layer is 1024 dimensions, and the residual error is expressed as follows:
F(x)=H(x)+x
where x represents the vector of the input, H (x) is the output of the deep layer, and F (x) represents the input after the short-link transform of the different layers.
CN202011359400.3A 2020-11-27 2020-11-27 Stepped voice enhancement method based on deep learning Active CN112562710B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011359400.3A CN112562710B (en) 2020-11-27 2020-11-27 Stepped voice enhancement method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011359400.3A CN112562710B (en) 2020-11-27 2020-11-27 Stepped voice enhancement method based on deep learning

Publications (2)

Publication Number Publication Date
CN112562710A true CN112562710A (en) 2021-03-26
CN112562710B CN112562710B (en) 2022-09-30

Family

ID=75046395

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011359400.3A Active CN112562710B (en) 2020-11-27 2020-11-27 Stepped voice enhancement method based on deep learning

Country Status (1)

Country Link
CN (1) CN112562710B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1737906A (en) * 2004-03-23 2006-02-22 哈曼贝克自动系统-威美科公司 Isolating speech signals utilizing neural networks
CN110648684A (en) * 2019-07-02 2020-01-03 中国人民解放军陆军工程大学 Bone conduction voice enhancement waveform generation method based on WaveNet
US20200312343A1 (en) * 2019-04-01 2020-10-01 Qnap Systems, Inc. Speech enhancement method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1737906A (en) * 2004-03-23 2006-02-22 哈曼贝克自动系统-威美科公司 Isolating speech signals utilizing neural networks
US20200312343A1 (en) * 2019-04-01 2020-10-01 Qnap Systems, Inc. Speech enhancement method and system
CN110648684A (en) * 2019-07-02 2020-01-03 中国人民解放军陆军工程大学 Bone conduction voice enhancement waveform generation method based on WaveNet

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
RANYA ALOUFI ET AL.: "《Privacy-preserving Voice Analysis via Disentangled Representations》", 《ARXIV:2007.15064V2》 *

Also Published As

Publication number Publication date
CN112562710B (en) 2022-09-30

Similar Documents

Publication Publication Date Title
CN110085245B (en) Voice definition enhancing method based on acoustic feature conversion
Latif et al. Adversarial machine learning and speech emotion recognition: Utilizing generative adversarial networks for robustness
CN111916101B (en) Deep learning noise reduction method and system fusing bone vibration sensor and double-microphone signals
Xiang et al. A nested u-net with self-attention and dense connectivity for monaural speech enhancement
CN105206271A (en) Intelligent equipment voice wake-up method and system for realizing method
CN109215665A (en) A kind of method for recognizing sound-groove based on 3D convolutional neural networks
CN109215674A (en) Real-time voice Enhancement Method
CN105741849A (en) Voice enhancement method for fusing phase estimation and human ear hearing characteristics in digital hearing aid
CN105448302B (en) A kind of the speech reverberation removing method and system of environment self-adaption
CN108564965B (en) Anti-noise voice recognition system
CN103761974B (en) Cochlear implant
CN115602165B (en) Digital employee intelligent system based on financial system
CN106024010A (en) Speech signal dynamic characteristic extraction method based on formant curves
CN110136709A (en) Audio recognition method and video conferencing system based on speech recognition
CN107274887A (en) Speaker's Further Feature Extraction method based on fusion feature MGFCC
Yuliani et al. Speech enhancement using deep learning methods: A review
CN104778948B (en) A kind of anti-noise audio recognition method based on bending cepstrum feature
CN109599094A (en) The method of sound beauty and emotion modification
CN113035203A (en) Control method for dynamically changing voice response style
CN112562710B (en) Stepped voice enhancement method based on deep learning
WO2012159370A1 (en) Voice enhancement method and device
Li et al. A mapping model of spectral tilt in normal-to-Lombard speech conversion for intelligibility enhancement
CN114550701A (en) Deep neural network-based Chinese electronic larynx voice conversion device and method
CN112614502B (en) Echo cancellation method based on double LSTM neural network
Kashani et al. Speech Enhancement via Deep Spectrum Image Translation Network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant