CN110619886B - End-to-end voice enhancement method for low-resource Tujia language - Google Patents

End-to-end voice enhancement method for low-resource Tujia language Download PDF

Info

Publication number
CN110619886B
CN110619886B CN201910966022.6A CN201910966022A CN110619886B CN 110619886 B CN110619886 B CN 110619886B CN 201910966022 A CN201910966022 A CN 201910966022A CN 110619886 B CN110619886 B CN 110619886B
Authority
CN
China
Prior art keywords
network
language
tujia
corpus
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910966022.6A
Other languages
Chinese (zh)
Other versions
CN110619886A (en
Inventor
于重重
康萌
陈运兵
徐世璇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Technology and Business University
Original Assignee
Beijing Technology and Business University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Technology and Business University filed Critical Beijing Technology and Business University
Priority to CN201910966022.6A priority Critical patent/CN110619886B/en
Publication of CN110619886A publication Critical patent/CN110619886A/en
Application granted granted Critical
Publication of CN110619886B publication Critical patent/CN110619886B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an end-to-end voice enhancement method for low-resource Tujia language, belongs to the field of voice signal processing, relates to a voice enhancement technology for low-resource language, and aims at the diversity, randomness and non-stationarity of environmental noise in Tujia language data to realize end-to-end voice rapid enhancement processing. The method comprises the following steps: and generating a countermeasure network based on deep convolution, establishing an end-to-end low-resource Tujia language voice enhancement model, performing rapid enhancement processing, realizing the end-to-end Tujia language voice rapid enhancement processing, and effectively removing the environmental noise of the Tujia language voice under the condition of almost no distortion.

Description

End-to-end voice enhancement method for low-resource Tujia language
Technical Field
The invention belongs to the field of voice signal processing, relates to a voice enhancement technology of low-resource language, and particularly relates to an end-to-end voice enhancement method for low-resource Tujia language based on a deep convolution generation countermeasure network.
Background
The speech enhancement technology is a preprocessing part of a speech digital signal, mainly extracts a pure original speech signal from a noisy speech signal as much as possible, and mainly aims at two points: firstly, background noise is inhibited, voice quality is improved, and hearing fatigue of people is eliminated, which is subjective measurement; the second is to improve the intelligibility of speech, which is an objective measure. Speech recognition technology has now entered a practical stage, but many recognition systems are environmentally demanding. In practical applications, environmental noise pollution can degrade the performance of a speech processing system. Therefore, the voice enhancement technology can effectively solve the noise pollution and improve the accuracy of the voice recognition system. At present, the voice enhancement system is widely applied in the fields of voice communication, multimedia technology and the like.
The traditional speech enhancement algorithm has spectral subtraction, the calculated amount is small, the speech signal distortion and the residual noise can be simply controlled, but the music noise is easy to remain; adaptive filtering, such as wiener filtering, kalman filtering, requires knowledge of some characteristic or statistical property of the noise. Time-domain based subspace decomposition may also be used for speech enhancement, but works better in low signal-to-noise or white noise situations. With the rapid development of deep learning, a method adopting a deep neural network is also widely concerned by people for speech enhancement, and has obvious advantages in non-stationary noise processing compared with the traditional method, however, most deep network models are supervised training, and the models depend on a large amount of labeled data and long-time training.
Tujia language as the language of the generations of Tujia in China, which contains rich national culture connotations, but because the number of people using Tujia language is sharply reduced, the transmission of the oral language has a fault phenomenon, and no text recording form exists, so that the Tujia language is in a state of crisis of imminent extinction. In addition, the application range of Tujia language is very limited, and the regions with better retention are in mountain deep valleys with inconvenient traffic and very closed. Under the condition, the quantity of data which can be acquired is very limited, a professional recording studio is difficult to find, the processes of investigating and acquiring the Tujia language are in a natural environment, the phenomenon that the audio file contains noise is difficult to avoid, and the noise such as animal calling, motor vehicle sound, current sound emitted by acquisition equipment and interference of simultaneous speaking of multiple persons submerges useful voice information in the noise, thereby further influencing the subsequent tasks of Tujia language labeling and voice recognition.
To ensure that high quality corpora are obtained, removing noise from Tujia speech data is a challenging study. By adopting the existing voice denoising method, the Tujia language labeling and voice recognition are difficult to realize, and the accuracy of the Tujia language voice recognition is low.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides an end-to-end speech enhancement method for low-resource Tujia language based on a deep convolution generation countermeasure network, which aims at the diversity, randomness and non-stationarity of environmental noise in Tujia language data and realizes end-to-end speech rapid enhancement processing.
The method can lay a research foundation for a language digital resource library, improve the accuracy of subsequent speech recognition, help a phonetician to finish recording and storing endangered languages, more intuitively and vividly show the language appearance and the cultural connotation thereof, and has important practical significance for protecting and inheriting the language culture.
The technical scheme provided by the invention is as follows:
an end-to-end voice enhancement method for low-resource Tujia language is characterized in that a countermeasure network is generated based on deep convolution, an end-to-end low-resource Tujia language voice enhancement model is established, end-to-end Tujia language voice fast enhancement processing is achieved, and environmental noise of the Tujia language voice is effectively removed, and the method comprises the following steps:
1) the method comprises the following steps of constructing a Tujia language corpus, classifying and segmenting Tujia language recorded voice data to obtain original noisy linguistic data of the Tujia language and original clean linguistic data of the Tujia language, and intercepting from the original noisy linguistic data of the Tujia language to obtain pure noise fragments:
11) firstly, dividing a Tujia language corpus into two parts according to the quality of Tujia language recorded voice data: noiseless data (native original clean corpus) and noisy data (native original noisy corpus). In noisy data, the non-human sound segments between sentences also contain ambient noise, so that the noise segments can be cut out according to a speech processing tool (such as ELAN software) to obtain pure noise segments. Specifically, the original noise corpus of Tujia language has noise in both the places with voice and without voice. And intercepting the segments without the voice as pure noise segments of the Tujia language.
12) The noisy data and the non-noisy data of the Tujia language are both long narration (long voice sentence), and the long narration needs to be segmented by using a voice processing tool (such as adopting a cross-platform multifunctional phonetics professional software Praat script), and the obtained independent short sentences after segmentation are still divided into two types: native original noisy corpus and native original clean corpus.
2) Expanding a corpus:
due to the limited data volume of the Tujia language voice, a Chinese voice data set (such as a 30-hour (tchs 30) Chinese voice data set of Qinghua university) is used as the extension data of the Tujia language and is called as Chinese original clean corpus, so that the problem of insufficient Tujia language voice data is solved. Respectively adding the pure noise fragments intercepted in the step 11) into the native language original clean corpus and the Chinese original clean corpus to respectively obtain new corpuses, and respectively calling the obtained new corpuses as the native language synthesized noisy corpus and the Chinese synthesized noisy corpus.
3) Establishing an end-to-end voice enhancement model:
the method adopts Deep convolution to generate a confrontation Network (DCGAN) to establish an end-to-end Tujia language speech enhancement model, and performs speech enhancement on Tujia language corpus;
the end-to-end Tujia language speech enhancement model comprises: generating a network and judging the network; the generation network adopts an encoding-decoding end-to-end full convolution network structure. And (3) inputting the enhanced voice and the real clean voice into a discrimination network for classification by adopting a countermeasure training setting, and judging the truth of the input signal as much as possible so as to transmit the input signal to a generation network, so that the enhanced model can finely adjust the output waveform of the enhanced model towards the real distribution until the discrimination network is difficult to distinguish the authenticity of the input signal, and the aim of removing the noise signal is fulfilled. The invention adds Spectral Normalization (SN) to each convolutional layer of the network, and constrains the Lipschitz constant of the network by limiting the Spectral norm of each layer. During model training, the model training can be more stable by adopting the unbalanced learning rate, namely, the learning rate and the different updating rates are respectively set for the generation network and the judgment network.
The following operations are specifically executed:
31) firstly, a time domain oscillogram of a Chinese synthesized noisy corpus is used as the input of a generating network. Framing the waveform by adopting an overlapping sliding window mode, wherein the specific implementation time window is 1 second, and 500 milliseconds are overlapped between frames; then inputting 11 convolution layers in the stage of generating network coding to obtain compressed vectors, and enabling the compressed vectors to enter the stage of generating network decoding. The decoding stage and the encoding stage are in a mirror image relationship, 11 deconvolution layers are provided, the convolution kernel parameters are consistent with the convolution layer parameters corresponding to the encoding stage, each deconvolution layer simultaneously receives the result of the last deconvolution layer and the result of the convolution layer which is symmetrical in the encoding stage, the two results are transmitted to the next deconvolution layer through weighted addition, and finally the enhanced Chinese clean corpus is obtained;
32) the discrimination network receives the original Chinese clean corpus and the enhanced Chinese clean corpus obtained in the step 31), the discrimination result (output 0 or 1) is obtained by classifying through multilayer convolution of the discrimination network, the discrimination result is transmitted to the generation network, the generation network calculates the loss value according to the loss function to carry out back propagation to update the weight of each layer, and a new round of enhancement training is carried out on the noisy corpus; and the discrimination network continuously receives the enhancement result of the generated network and calculates the loss value according to the loss function. Repeating the iteration until the input source can not be distinguished by the distinguishing network (at the moment, the output is set to be 0.5), and obtaining an end-to-end voice enhancement model;
4) fine tuning (Fine-tuning) is carried out on the voice enhancement model obtained in the step 3) to continue training to obtain an end-to-end Tujia voice enhancement model called Fine-tuning DCGAN (FDCGAN), and the specific operation is as follows: inputting the native language original clean corpus obtained in the step 1) and the native language synthesized noisy corpus obtained in the step 2) as training data into the end-to-end speech enhancement model obtained in the step 3), modifying the learning rate and batch processing parameters of the model for training, and finally obtaining a trained end-to-end native language speech enhancement model FDCGAN;
5) inputting the Tujia language data to be subjected to voice enhancement into the trained end-to-end Tujia language voice enhancement model FDCGAN obtained in the step 4), namely outputting the enhanced Tujia language voice.
In specific implementation, the original Tujia language noisy corpus in the step 1) is used as test data to test the Tujia language voice enhancement model obtained in the step 4), and a voice quality evaluation tool is used to verify and evaluate the Tujia language voice enhancement model provided by the invention.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides a voice enhancement model based on an improved deep convolution countermeasure generation network aiming at the diversity, randomness and non-stationarity of environmental noise in Tujia voice data, which can carry out rapid enhancement processing and realize end-to-end enhancement processing on Tujia voice frequency files. Because the Tujia language has low resource and very limited data volume, the invention adopts the Chinese speech data set as the extension, so that the model generalization is stronger. Compared with the prior art, the method adds Spectral Normalization (SN) into each convolution layer, and constrains the Lipschitz constant of the network by limiting the Spectral norm of each layer. During model training, the model training can be more stable by adopting the unbalanced learning rate, namely, the learning rate and the different updating rates are respectively set for the generation network and the judgment network. Compared with the existing mainstream speech enhancement method, the result shows that the environmental noise in the Tujia speech can be effectively removed under the condition of almost no distortion.
Drawings
FIG. 1 is a block flow diagram of a specific embodiment of the process of the present invention.
FIG. 2 is a schematic diagram of an end-to-end speech enhancement model employed in an embodiment of the present invention.
Fig. 3 is a schematic diagram of the change of the loss function value when the network training is generated in the embodiment of the present invention.
Fig. 4 is a schematic diagram of the change of the loss function value during the discriminant network training in the embodiment of the present invention.
In fig. 3 to 4, the number of iterations (passes) is plotted on the abscissa, and the loss function value (loss) is plotted on the ordinate.
FIG. 5 is a Turkish spectrogram without enhancement according to an embodiment of the present invention.
Fig. 6 is a Tujia language spectrogram after enhancement in the embodiment of the present invention.
In fig. 5 to 6, the abscissa is Time (Time) and the ordinate is frequency (Hz).
Detailed Description
The invention is further illustrated by the following examples in connection with the accompanying drawings, without in any way limiting the scope of the invention.
The following embodiment describes the implementation process of the speech enhancement method provided by the present invention in detail by using a thchs30 chinese corpus which comprises 27 short spoken language corpora, 7 hours, 8 minutes and 59 seconds of total duration of Tujia language data, and 25 people who record the data to make the total duration of more than 30 hours.
The flow diagram of the specific implementation of the method is shown in fig. 1. The invention provides an end-to-end voice enhancement method based on a deep convolution countermeasure generation network for low-resource Tujia language, which is characterized in that the Tujia language has low resource, and experimental data is expanded to construct a database; the deep convolutional network is combined with the generation countermeasure training to enhance the voice signals, and fine tuning retraining is carried out on Tujia language data, so that the final model is stronger in generalization and better in enhancement effect. The model directly inputs an original voice signal and outputs an enhanced voice signal, and the end-to-end method can keep phase detail information on the time domain of the original voice signal. In the deep convolution generation confrontation network, each convolution layer adopts spectrum normalization, and model training cost is reduced by modifying a loss function and network level parameters. And unbalanced learning rate is adopted when the network is generated and judged in training, so that the training of the network and the network is more stable. The specific implementation steps are as follows:
data preprocessing and database construction:
1) the Tujia language data set is divided into two parts, one part is noisy data, and the other part is non-noisy data. The method comprises the following steps of utilizing ELAN software (a labeling tool for creating, editing, visualizing and searching identifications of video and audio data, which can provide sound technology for the identifications and develop and utilize multimedia clips) and a phonetics software Praat script (a cross-platform multifunctional phonetics professional software), segmenting the voice data into short sentences called native language original noisy linguistic data and native language original clean linguistic data, and manually intercepting noise segments in the noisy data, wherein the noise types comprise cock cry, chick cry, motor vehicle sound electronic equipment interference noise and other noise, and the quantity is shown in a table 1:
TABLE 1 Tujia language noise types and numbers
Figure RE-GDA0002272774140000061
2) And superposing the noise fragments to the original Tujia language clean corpus and the original Chinese language clean corpus through an audio conversion and processing program sox tool. The noise superposition method is to randomly select a starting position on a sampling point, and inject different noises into the recording of each person in the Chinese original clean corpus according to the proportion of the number of each type of noises to the total number of the noises, as shown in formula (1):
Figure RE-GDA0002272774140000062
wherein N isiRepresenting the number of noises i, MjRepresenting the number of recordings, m, of the jth individual in the original clean corpus of ChineseijIndicates the number of pieces of noise i injected into the recording of the jth individual in the books of thchs 30; in specific implementation, i is 1, …,5, j is 1, …, 25. The noise injection method of the original clean corpus of the Tujia language is the same, and the obtained new corpus is called Tujia language synthesized noisy corpus and Chinese synthesized noisy corpus.
Speech enhancement model training process, the end-to-end speech enhancement model is shown in fig. 2:
1) inputting the voice waveform (expressed by z) of the Chinese original noisy corpus into a generating network, wherein the coding end of the generating network consists of 11 1-dimensional stepping convolution layers with the width of 31 and the step length of 2, and the number of each layer of filters is respectively as follows: 16. 32, 64, 128, 256, 512 and 1024, the decoding end and the encoding end keep a mirror relationship, and the decoding end and the encoding end also contain 11 deconvolution layers with the same parameters. The arrows in fig. 2 indicate a jump connection, i.e. the information of the convolutional feature map is transferred to the corresponding deconvolution layer, and the node receiving the last deconvolution layer transfers the two results to the next deconvolution layer by weighted addition, so as to avoid loss of details. The activation function for each convolutional layer employs a PReLU function. The resulting network output is denoted g (z) for the chinese enhanced speech waveform.
2) Generating Chinese enhanced speech G (z) and Chinese originalThe net voice input of beginning to end determines the network. The discrimination network is composed of a 1-dimensional two-class convolution network, and comprises two channels for acquiring input sources, wherein each channel is 16384 sampling points, the last layer is convolution of 1 x 1, and each layer uses a LeakyRelu nonlinear activation function with an alpha value of 0.3. Distinguishing the loss function L of a networkDExpressed as formula (2):
Figure RE-GDA0002272774140000071
wherein x represents clean speech; pdataA distribution function representing compliance of clean speech x; z is noisy speech; pzRepresenting the distribution function obeyed by noisy speech z. If the input is G (z), judging that the network output D (G (z)) is 0; if the input is x, the network output D (x) is judged to be 1.
3) The discrimination network transmits the discrimination result to the generation network, and the generation network calculates the loss function L according to the formula (3)G
Figure RE-GDA0002272774140000072
And the two networks carry out back propagation according to the loss value to update the weights of the layers until D (G (z)) D (x) is 0.5, namely, the networks cannot identify whether the input signal is an original clean voice signal or generate network-enhanced clean voice, and the training is finished.
4) The set generation network and the discrimination network are updated at a rate of 1 to 1, and both networks are trained while the other network remains frozen. The generated net learning rate was 0.0001, the judged net learning rate was 0.0003, and the batch processing parameter was 24. In the model training process, the invention selects the spectrum normalization and the unbalanced learning rate to make the training process more stable. The spectrum normalization limits the spectrum norm of each layer to constrain the Lipschitz constant of the discrimination network, and the gradient upper bound of the Lipschitz continuity function is limited, so that the function is smoother, the parameter change is more stable in the optimization process of the neural network, and gradient explosion is not easy to occur. Updating parameters of the generation network and the discrimination network by using learning rates a (n) and b (n), which are expressed as formula (4) and formula (5):
Figure RE-GDA0002272774140000081
Figure RE-GDA0002272774140000082
wherein, thetan、h(θnn)、
Figure RE-GDA0002272774140000083
Respectively generating a parameter vector, a random descending gradient and a random vector updated for the nth time of the network; omegan、g(θnn)、
Figure RE-GDA0002272774140000084
The parameter vector, the random descending gradient and the random vector updated at the nth time of the network are judged.
5) After the voice enhancement model is trained by utilizing the Chinese corpus, the Tujia language is synthesized into the noisy corpus and the original clean corpus of the Tujia language, and under the condition that other parameters are consistent, the network learning rate is set to be 0.00006, the network learning rate is judged to be 0.0001, the batch processing parameter is 16, and the model is trained again, so that the generalization performance of the model is better. The change of the loss function in the model during network generation and discriminant network training is shown in fig. 3 and 4.
6) Finally, the model is tested by adopting the original Tujia language noisy corpus, the Tujia language spectrogram without enhancement is shown in the figure 5, and the Tujia language spectrogram after enhancement is shown in the figure 6.
Comparing the enhancement method aiming at the Tujia language voice data with the conventional voice enhancement method and the voice enhancement method based on the deep circulation neural network, the evaluation index selects subjective voice Quality assessment (PESQ) and Mean Opinion Score-voice Quality index (MOSLQO). The PESQ is a typical algorithm in voice quality evaluation, a linear scoring system is adopted, the algorithm is widely used, the numerical value is-0.5-4.5, the speech quality is high and low compared with the input test speech and the output speech, and the higher the score is, the better the speech quality is. The evaluation results are shown in table 2:
TABLE 2 comparison of results of different enhancement methods
Figure RE-GDA0002272774140000091
The results in table 2 show that the end-to-end speech enhancement method based on deep convolution generation countermeasure network provided by the invention can effectively remove the environmental noise in the Tujia speech, has a better enhancement effect, and lays a stable foundation for speech recognition.
It is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various substitutions and modifications are possible without departing from the spirit and scope of the invention and appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.

Claims (9)

1. An end-to-end voice enhancement method aiming at low-resource Tujia language is characterized in that a countermeasure network is generated based on deep convolution, an end-to-end low-resource Tujia language voice enhancement model is established, end-to-end rapid enhancement processing of the Tujia language voice is achieved, and environmental noise of the Tujia language voice is effectively removed; the method comprises the following steps:
1) constructing a Tujia language corpus, classifying and segmenting Tujia language recorded voice data to obtain original noisy linguistic data of the Tujia language and original clean linguistic data of the Tujia language, and intercepting from the original noisy linguistic data of the Tujia language to obtain pure noise fragments;
2) expanding a corpus: the method comprises the steps of utilizing Chinese original clean corpus as expansion data of Tujia language, adding pure noise fragments into the Tujia language original clean corpus and the Chinese original clean corpus respectively, and calling obtained new corpus as Tujia language synthesized noisy corpus and Chinese synthesized noisy corpus respectively;
3) establishing and training an end-to-end Tujia language voice enhancement model; the method comprises the following steps:
adopting deep convolution to generate a confrontation network DCGAN to establish an end-to-end Tujia language voice enhancement model;
the end-to-end Tujia language speech enhancement model comprises: generating a network and judging the network;
the generation network adopts an end-to-end full convolution network structure of coding-decoding;
adding spectrum normalization into each convolution layer of the network, and constraining the Lipschitz constant of the network by limiting the spectrum norm of each convolution layer;
adopting an antagonistic training setting, inputting the enhanced voice and real clean voice into a discrimination network for classification, judging the truth of an input signal, and transmitting the judged truth to a generation network, so that an end-to-end Tujia language voice enhancement model finely adjusts the output waveform of the model towards the real distribution, thereby achieving the purpose of removing noise signals;
training an end-to-end Tujia language speech enhancement model, specifically comprising performing the following operations:
31) taking the time domain oscillogram of the Chinese synthesized noisy linguistic data as the input of a generating network;
inputting the time domain oscillogram into a plurality of convolution layers in a network coding stage in an overlapping sliding window mode to obtain a compressed vector;
the compressed vector enters a network decoding stage;
the generated network decoding stage and the encoding stage are in a mirror image relationship, and the parameters of the deconvolution layer and the convolution kernel are consistent with the parameters of the convolution layer corresponding to the encoding stage; each deconvolution layer simultaneously receives the result of the last deconvolution layer and the result of the convolution layer which is symmetrical in the coding stage, and the results are transmitted to the next deconvolution layer through weighted addition, so that the enhanced Chinese clean corpus is obtained;
32) judging whether the network receives the original Chinese clean corpus or not and the enhanced Chinese clean corpus obtained in the step 31), and classifying the Chinese clean corpus through multilayer convolution of the judging network to obtain a judging result;
transmitting the judgment result to a generation network;
performing mutual transfer cycle training by calculating network loss functions until the input source cannot be judged by a judgment network, and obtaining an end-to-end Tujia language voice enhancement model;
4) carrying out fine tuning training on the end-to-end Tujia language voice enhancement model obtained in the step 3) to obtain a trained end-to-end Tujia language voice enhancement model FDCGAN; the specific operation is as follows:
inputting the native language original clean corpus in the step 1) and the native language synthesized noisy corpus obtained in the step 2) as training data into the end-to-end native language speech enhancement model obtained in the step 3), modifying the learning rate and batch processing parameters of the model, and training to finally obtain a trained end-to-end native language speech enhancement model FDCGAN;
5) inputting the Tujia language data to be subjected to voice enhancement into the trained end-to-end Tujia language voice enhancement model FDCGAN obtained in the step 4), namely outputting the enhanced Tujia language voice.
2. The method as claimed in claim 1, wherein the step 1) of constructing the Tujia language corpus comprises the following steps:
11) firstly, dividing a Tujia language corpus into two parts according to the quality of Tujia language recorded voice data: the noiseless data and the noisy data are native original noisy corpora and native original clean corpora respectively;
then, a voice processing tool is used for cutting out noise segments in the noisy data to obtain pure noise segments;
12) segmenting long sentences of voice data to obtain independent short sentences; phrases are still divided into two categories: native original noisy corpus and native original clean corpus.
3. The method as claimed in claim 1, wherein the step 2) expands the corpus, and particularly adopts the 30-hour chinese speech data set thchs30 of the university of qinghua as the expanded data of the native language.
4. The method of claim 1, wherein the plurality of convolutional layers is 11 convolutional layers.
5. The method as claimed in claim 1, wherein, in step 2), the pure noise segments are added to the native clean corpus and the chinese clean corpus respectively, and are overlapped by audio conversion and processing tools; the following method is adopted:
randomly selecting a starting position on a sampling point, and injecting different noises into the recording of each person in the Chinese original clean corpus according to the proportion of the number of each type of noises to the total number of the noises, wherein the formula is expressed as (1):
Figure FDA0003305550090000031
wherein N isiRepresenting the number of noises i, MjRepresenting the number of recordings, m, of the jth individual in the original clean corpus of ChineseijRepresents the number of bars injecting noise i into the recording of the j' th individual in the ths 30 corpus.
6. The method as claimed in claim 1, wherein the activation function for each convolutional layer of the network in step 3) is a PReLU function.
7. The method as claimed in claim 1, wherein in step 3), the discriminating network is formed by a 1-dimensional binary convolutional network, and there are two channels for obtaining input sources, each channel has 16384 sampling points, and the last layer is a 1 × 1 convolutional layer;
judging that each layer of the network uses a LeakyRelu nonlinear activation function;
distinguishing the loss function L of a networkDRepresented by formula (2):
Figure FDA0003305550090000032
wherein x represents clean speech; pdataA distribution function representing compliance of clean speech x; z is noisy speech; let pure speech x distribute obey PdataNoisy speech z-distribution obeys pz(ii) a If the input is G (z), judging that the network output D (G (z)) is 0; if the input is x, judging that the network output D (x) is 1;
the discrimination network transmits the discrimination result to the generation network, and the generation network calculates the loss function L according to the formula (3)G
Figure FDA0003305550090000033
And the two networks carry out back propagation to update the weight of each layer according to the loss value until the network cannot identify whether the input signal is an original clean voice signal or generates a network-enhanced clean voice, and then the training is finished.
8. The method of claim 7, wherein the specific setup generates the network and determines the network to be updated at a rate of 1 to 1, and one of the network and the network is kept frozen while training.
9. The method of claim 8, wherein the training process is stabilized by using spectral normalization and unbalanced learning rate; the spectrum normalization limits the spectrum norm of each layer, and restrains the Lipschitz constant of the discrimination network;
updating parameters of the generation network and the discrimination network by using learning rates a (n) and b (n), which are expressed as formula (4) and formula (5):
Figure FDA0003305550090000041
Figure FDA0003305550090000042
wherein, thetan、h(θnn)、
Figure FDA0003305550090000043
Respectively generating a parameter vector, a random descending gradient and a random vector updated for the nth time of the network; omegan、g(θnn)、
Figure FDA0003305550090000044
Respectively judging the parameter vector, the random descending gradient and the random vector updated by the network at the nth time.
CN201910966022.6A 2019-10-11 2019-10-11 End-to-end voice enhancement method for low-resource Tujia language Active CN110619886B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910966022.6A CN110619886B (en) 2019-10-11 2019-10-11 End-to-end voice enhancement method for low-resource Tujia language

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910966022.6A CN110619886B (en) 2019-10-11 2019-10-11 End-to-end voice enhancement method for low-resource Tujia language

Publications (2)

Publication Number Publication Date
CN110619886A CN110619886A (en) 2019-12-27
CN110619886B true CN110619886B (en) 2022-03-22

Family

ID=68925699

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910966022.6A Active CN110619886B (en) 2019-10-11 2019-10-11 End-to-end voice enhancement method for low-resource Tujia language

Country Status (1)

Country Link
CN (1) CN110619886B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112002343B (en) * 2020-08-18 2024-01-23 海尔优家智能科技(北京)有限公司 Speech purity recognition method and device, storage medium and electronic device
CN112185417B (en) * 2020-10-21 2024-05-10 平安科技(深圳)有限公司 Method and device for detecting artificial synthesized voice, computer equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107293289A (en) * 2017-06-13 2017-10-24 南京医科大学 A kind of speech production method that confrontation network is generated based on depth convolution
CN109326302A (en) * 2018-11-14 2019-02-12 桂林电子科技大学 A kind of sound enhancement method comparing and generate confrontation network based on vocal print
CN109524020A (en) * 2018-11-20 2019-03-26 上海海事大学 A kind of speech enhan-cement processing method
CN110111803A (en) * 2019-05-09 2019-08-09 南京工程学院 Based on the transfer learning sound enhancement method from attention multicore Largest Mean difference

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017190219A1 (en) * 2016-05-06 2017-11-09 Eers Global Technologies Inc. Device and method for improving the quality of in- ear microphone signals in noisy environments
CN109147810B (en) * 2018-09-30 2019-11-26 百度在线网络技术(北京)有限公司 Establish the method, apparatus, equipment and computer storage medium of speech enhan-cement network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107293289A (en) * 2017-06-13 2017-10-24 南京医科大学 A kind of speech production method that confrontation network is generated based on depth convolution
CN109326302A (en) * 2018-11-14 2019-02-12 桂林电子科技大学 A kind of sound enhancement method comparing and generate confrontation network based on vocal print
CN109524020A (en) * 2018-11-20 2019-03-26 上海海事大学 A kind of speech enhan-cement processing method
CN110111803A (en) * 2019-05-09 2019-08-09 南京工程学院 Based on the transfer learning sound enhancement method from attention multicore Largest Mean difference

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Time-Frequency Mask-based Speech Enhancement using Convolutional Generative Adversarial Network;Neil Shah et al;《Proceedings, APSIPA Annual Summit and Conference 2018》;20181231;第1246-1251页 *
利用生成噪声提高语音增强方法的泛化能力;袁文浩等;《电子学报》;20190415(第04期);第791-797页 *

Also Published As

Publication number Publication date
CN110619886A (en) 2019-12-27

Similar Documents

Publication Publication Date Title
CN108899051B (en) Speech emotion recognition model and recognition method based on joint feature representation
CN110491416B (en) Telephone voice emotion analysis and identification method based on LSTM and SAE
CN108717856B (en) Speech emotion recognition method based on multi-scale deep convolution cyclic neural network
CN110782872A (en) Language identification method and device based on deep convolutional recurrent neural network
CN108447495B (en) Deep learning voice enhancement method based on comprehensive feature set
CN103377651B (en) The automatic synthesizer of voice and method
CN105788592A (en) Audio classification method and apparatus thereof
CN106328123B (en) Method for recognizing middle ear voice in normal voice stream under condition of small database
CN105895082A (en) Acoustic model training method and device as well as speech recognition method and device
Rammo et al. Detecting the speaker language using CNN deep learning algorithm
CN110619886B (en) End-to-end voice enhancement method for low-resource Tujia language
CN113823323B (en) Audio processing method and device based on convolutional neural network and related equipment
CN111341294A (en) Method for converting text into voice with specified style
CN110992959A (en) Voice recognition method and system
Jie et al. Speech emotion recognition of teachers in classroom teaching
Lim et al. Harmonic and percussive source separation using a convolutional auto encoder
Cao et al. Speaker-independent speech emotion recognition based on random forest feature selection algorithm
CN113571095B (en) Speech emotion recognition method and system based on nested deep neural network
CN109461447B (en) End-to-end speaker segmentation method and system based on deep learning
Jin et al. Speech separation and emotion recognition for multi-speaker scenarios
CN114141237A (en) Speech recognition method, speech recognition device, computer equipment and storage medium
CN111402919B (en) Method for identifying style of playing cavity based on multi-scale and multi-view
CN112116921A (en) Single sound track voice separation method based on integration optimizer
Südholt et al. Pruning deep neural network models of guitar distortion effects
CN116467416A (en) Multi-mode dialogue emotion recognition method and system based on graphic neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant