CN112562710A

CN112562710A - Stepped voice enhancement method based on deep learning

Info

Publication number: CN112562710A
Application number: CN202011359400.3A
Authority: CN
Inventors: 胡静; 万里; 王建荣; 刘李
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2020-11-27
Filing date: 2020-11-27
Publication date: 2021-03-26
Anticipated expiration: 2040-11-27
Also published as: CN112562710B

Abstract

The invention relates to a stepped voice enhancement method based on deep learning, which is characterized by comprising the following steps: the method comprises the following steps: a) establishing a first-order proper machine perception framework; b) the second order fits the human auditory perception framework. The invention has scientific and reasonable design, and can obtain the voice suitable for machine perception and the voice suitable for human ear perception based on the effectiveness of the stepped method.

Description

Stepped voice enhancement method based on deep learning

Technical Field

The invention belongs to the field of voice signal processing, relates to a signal processing technology, and particularly relates to a stepped voice enhancement method based on deep learning.

Background

Speech is the most basic and important way for humans to communicate information to each other, speech is the human-specific ability, and sound is the most important tool for speech to communicate information. Human-computer interaction is an important step in future development nowadays, and the situation of voice as a human-computer interaction door face is more and more important now, so that for an excellent human-computer interaction system, an excellent voice front end plays an irreplaceable role. In real life, the environment where people are located is full of noise all the time, so that a better background noise suppression scheme can be more beneficial to human-computer interaction and human-human interaction.

Automatic Speech Recognition (ASR) \\ cite { heres 2006review } is the conversion of Speech signals into text information using signal processing and pattern Recognition techniques. Nowadays, more and more products use a voice recognition technology, and speech recognition is the most important basic stone for future human-computer interaction to a certain extent. With the development of science and technology in more than ten years, particularly in 2009, the foundation of landing of voice recognition is laid by means of deep learning of computer computing power and machine learning and rapid development of big data technology, and remarkable achievements are obtained in the market. Various voice services are developed by large internet companies in a first-terrorist way, and particularly, voice recognition of a mobile terminal of a mobile phone is deep in daily life of people.

The voice productization is highly valued at home and abroad, and Siri developed by apple Inc. is a voice assistant embedded in each device of apple Inc. abroad. The user himself can query the real-time information by means of Siri invoking the system own application. Furthermore, asistat (google voice Assistant), Cortana (microsoft voice Assistant) and Alexa (amazon voice Assistant) are all the more perfect voice systems in the foreign systems. Although the research on voice recognition in domestic research is not as early as abroad, the domestic voice market reaches unprecedented prosperity nowadays, and the main product carrier is an intelligent sound box which is widely applied nowadays, such as a tianmao eidolon, a small intelligent sound box and the like. However, due to the variability of the actual scene, voice recognition cannot have a good effect in many scenes, for example, the small-sized moxa sound box used by people at present cannot make corresponding judgment accurately in a noisy scene, the vehicle-mounted voice recognition system cannot well recognize the speaking content in a scene with a high vehicle speed, and for example, the cochlear implant cannot well adapt to noise in a noisy scene, so that the hearing-impaired people cannot well judge the speaking content of others according to the auditory sense, and the like. Therefore, the speech signal level under a complex scene is improved, and the speech recognition preparation rate and the human ear perception judgment level can be improved to a great extent.

The key technology for improving the voice quality is to enhance the perception intensity of a target signal, namely, voice enhancement, which essentially suppresses the expression of a noise signal, thereby enhancing the target voice content. On one hand, the speech enhancement can better express the perception intensity of a target signal in a noisy speech signal so as to improve the auditory perception intensity of the noisy signal and improve the intelligibility of the noisy signal, and on the other hand, the robustness of the enhanced speech in other applications can be improved. Nowadays, deep learning is more and more loud, and brings very large gain in the signal field compared with the traditional signal processing method. In particular, compared with the traditional signal processing method, the deep learning shows a good voice separation effect on the suppression of non-static noise.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a stepped voice enhancement method based on deep learning, which can obtain an enhanced voice signal suitable for machine perception and further obtain a voice signal suitable for human ear auditory perception on the basis.

The technical problem to be solved by the invention is realized by the following technical scheme:

a stepped voice enhancement method based on deep learning is characterized in that: the method comprises the following steps:

a) establishment of a first-order fit machine-aware framework:

step S1, framing and windowing the voice signal, then performing short-time Fourier transform, and transforming the time sequence signal into a frequency domain signal;

step S2, designing a heuristic input mode, concentrating the input in a low-frequency area, enhancing the perception intensity of the low-frequency area, and specifically expressing the following formula:

Y＝(Y_i×signal(Y_i,…,Y_n),Y_i,…,Y_n)

Y_irepresenting a single frame special frame in an original voice signal, and representing signal activation;

and step S3, the activated features are sent into a first-order network, the network structure is formed by a residual DNN, and all the activation functions adopt ReLU to perform high-dimensional spatial noise separation.

Step S4, mapping the result of step S3 into 1025-dimensional enhanced linear characteristics, and then combining the characteristics with the phase of the original noisy speech to obtain a speech signal which is first-order enhanced and suitable for machine perception;

b) the second-order fit human ear auditory perception framework is established:

step S5, utilizing the result of the first-order frame and adopting a stacked residual DNN network structure to further perform signal enhancement conversion, and converting the signal into a Mel spectrum filtering signal designed based on human ear perception;

step S6, inputting the result obtained by the previous layer into two layers of BGRUs, so that the information between the frames is correlated, the independence between the frames is eliminated, and simultaneously, the frequency domain information is converted to time frequency linkage;

and step S7, obtaining the Mel characteristic spectrum value of (n,80) through linear transformation, wherein the characteristic spectrum is the characteristic spectrum value after second-order enhancement, and synthesizing the voice signal suitable for human ear listening perception by using a WaveRNN vocoder.

Furthermore, the hamming window with the window length of 46.5 ms, which is added in step S1, includes 1024 samples, the frame-to-frame coverage is 75%, includes 768 samples, and the time-series speech information is mapped to 1025 dimensions using 2048-dimensional fourier transform.

Moreover, the residual DNN structure is 4-layer DNN, the input of each layer is from the previous layer and the upper layer, each layer is 1024 dimensions, and the residual is expressed as follows:

F(x)＝H(x)+x

where x represents the vector of the input, H (x) is the output of the deep layer, and F (x) represents the input after the short-link transform of the different layers.

The invention has the advantages and beneficial effects that:

1. in the invention, a step-type speech enhancement network is adopted, and a step-type training scheme is adopted, wherein a first-order enhancement framework based on machine listening sensation is trained firstly, then a first-order model is frozen, a second-order enhancement framework based on human ear perception is trained, and finally the first-order enhancement framework and the second-order enhancement framework are combined for training and fine tuning the model; the experimental effect shows that the voice enhancement method based on the step type not only shows a good machine auditory sense noise reduction effect on a first-order result, but also obtains good voice based on human auditory sense in a second-order process.

2. Compared with other methods, the method provided by the invention has the advantages that a better result is obtained, and the result obtained on the subjective evaluation of people is obviously better than the result of machine perception, so that the effectiveness of the stepped method provided by the invention is proved, and the voice suitable for machine perception and the voice suitable for human ear perception can be obtained.

Drawings

FIG. 1 is a flow chart of the steps of the present invention;

fig. 2 is a histogram of the subjective perception scores of the inventors.

Detailed Description

The present invention is further illustrated by the following specific examples, which are intended to be illustrative, not limiting and are not intended to limit the scope of the invention.

a) establishment of a first-order fit machine-aware framework:

Y＝(Y_i×signal(Y_i,…,Y_n),Y_i,…,Y_n)

b) the second-order fit human ear auditory perception framework is established:

F(x)＝H(x)+x

To better verify the effectiveness of the method, the DNN-based speech enhancement method proposed by Xun et al, the CNN-based speech enhancement model proposed by Tomas et al, the LSTM-based speech enhancement model proposed by Chen et al, and the GRU-based speech enhancement model proposed by Bai et al were compared.

The experimental Evaluation indexes are a Speech Quality sensory Evaluation (PESQ), Short Time Objective Intelligibility (STOI) and Mean Opinion Score (MOS).

Tables 2 and 3 show the parameters of the first-order network and the second-order network, respectively, and table 1 shows the experimental verification results.

Table 1 experimental verification results

TABLE 2 first order network parameters

TABLE 3 second-order network parameters

As can be seen from the results, the results obtained by the method of the invention are better than those obtained by other methods, and the results obtained by subjective evaluation of people are obviously better than those obtained by machine perception. The effectiveness of the method based on the ladder type is explained, and the voice suitable for machine perception and human ear perception can be obtained.

Although the embodiments of the present invention and the accompanying drawings are disclosed for illustrative purposes, those skilled in the art will appreciate that: various substitutions, changes and modifications are possible without departing from the spirit and scope of the invention and the appended claims, and therefore the scope of the invention is not limited to the disclosure of the embodiments and the accompanying drawings.

Claims

1. A stepped voice enhancement method based on deep learning is characterized in that: the method comprises the following steps:

a) establishment of a first-order fit machine-aware framework:

Y＝(Y_i×signal(Y_i,…,Y_n),Y_i,…,Y_n)

b) the second-order fit human ear auditory perception framework is established:

2. The deep learning based stepped speech enhancement method of claim 1, wherein: the windowed hamming window in step S1 has a window length of 46.5 ms, and includes 1024 samples, the frame-to-frame coverage is 75%, and includes 768 samples, and the time-series speech information is mapped to 1025 dimensions using 2048-dimensional fourier transform.

3. The deep learning based stepped speech enhancement method of claim 1, wherein: the residual difference type DNN structure is 4 layers of DNN, the input of each layer is from the previous layer and the upper layer, each layer is 1024 dimensions, and the residual error is expressed as follows:

F(x)＝H(x)+x