CN112927673A

CN112927673A - Novel Uygur voice synthesis method

Info

Publication number: CN112927673A
Application number: CN202110180854.2A
Authority: CN
Inventors: 帕丽旦·木合塔尔; 买买提阿依甫
Original assignee: Xinjiang University Of Finance & Economics
Current assignee: Xinjiang University Of Finance & Economics
Priority date: 2021-02-09
Filing date: 2021-02-09
Publication date: 2021-06-08

Abstract

The invention discloses a novel Uygur speech synthesis method, the end-to-end speech synthesis speech based on deep learning has high naturalness, the HMM-based method has good system stability, the front end part of the system utilizes the HMM to obtain the inherent language characteristics of the Uygur speech, the back end synthesis part utilizes a deep neural network framework to establish the speech synthesis method of an autoregressive model, the continuity and the stability of the synthesized speech are obviously superior to those of the parameter synthesis method and the end-to-end synthesis method, and the naturalness achieves satisfactory effect.

Description

Novel Uygur voice synthesis method

Technical Field

The invention relates to the field of artificial intelligence, in particular to a novel Uygur voice synthesis method.

Background

An artificial neural network is a mathematical model based on the simulation of the biological nervous system. Reflects some basic characteristics of brain biological system to some extent, and is a network structure related to biodiversity process. In 2012, Hiton et al successfully applied deep learning to speech recognition and greatly improved the recognition rate, and then generated a speech synthesis method based on neural networks.

Among the neural network-based speech synthesis methods, the most commonly used are a recurrent neural network-based method (DNN), a recurrent neural network-based (RNN) method, and a long-term memory network-based (LSTM) method. The method for synthesizing a bidirectional RNN proposed by Schuster, which is used as a sequence learner, can encode context information of a current frame to generate a bidirectional sequence and perform modeling through sequence learning. Hochreiter et al, to solve the problem of gradient disappearance of conventional RNN, propose a new Long Short Term Memory LSTM (Short-time Memory) structure. A more recent popular approach is to employ deep belief networks (DBMs) while modeling the relationship of speech and acoustic features. And estimating the function prediction acoustic features through a real-value audio track neural autoregressive density and a deep mixing density network. A deep feed-forward neural network can be viewed as an alternative to decision trees in HMM-based speech. The recurrent neural network is used to express TTS as a sequence-to-sequence mapping problem. Under the constraint of adding context, the long-time memory network combines a Gate Recursion Unit (GRU) based on RNN with a mixed density model to predict the sequence of the probability density function. Therefore, the technical problem of the above reaction is a problem to be solved by those skilled in the art.

Disclosure of Invention

The technical problem to be solved by the invention is to overcome the defects of the technology and provide a novel Uygur speech synthesis method.

In order to solve the technical problems, the technical scheme provided by the invention is a novel Uygur speech synthesis method: the method comprises the following steps:

(1) forming a recurrent neural network by using two recurrent neural networks;

(2) encoding in a source language and decoding in a target language using a recurrent neural network, the encoder mapping variable length linear sequences to fixed length vectors and the decoder mapping vector representations to variable length target sequences;

(3) reading forward from the starting point of the text sequence by using the RNN, and reading from the end point of the text sequence by using another RNN model;

(5) and expanding memory by a long-time memory network, and using an LSTM unit as a construction unit of an RNN layer.

A novel Uygur voice synthesis system comprises a training module and a synthesis module;

the training module is used for constructing language features and extracting acoustic features, and sending data to the synthesis part;

the synthesis module is used for inputting data and receiving the data sent by the training module to synthesize voice.

As an improvement, the training module comprises a database, a text processing module and a voice processing module.

As an improvement, the synthesis module comprises a regression model, a text input module and a synthesized voice module.

As an improvement, the construction of the language features comprises the following steps:

carrying out front-end text processing and generating a corresponding label file;

B. coding the label file, mapping each context label to a feature vector as the input of DNN language feature vector

C. Performing up-sampling processing to construct and complete language features;

D. and normalizing the language features by adopting a minimum and maximum normalization mode.

As an improvement, the encoding process of the markup document, and mapping each context tag to a feature vector, includes the following steps:

1) extracting phonemes and context features from the text using a front-end tool;

2) aligning the text and the audio of the training data to obtain the start time and the end time of each phoneme;

3) and converting the phoneme structured representation generated by the front-end tool into a corresponding file, and using the same annotation file format.

As an improvement, the extraction of the acoustic features comprises the following steps:

the new Uygur speech synthesis method includes reading the spectral envelope information of the speech signal with the vocoder;

b. converting the MFCC characteristics into MGC parameters, and extracting spectral envelope information;

C. and extracting aperiodic features with variable dimensions, and then converting the voice fundamental frequency features.

As an improvement, the regression model generates a state duration feature through the state duration model, and finally, the feature obtained by combining the state duration feature and the language feature is input into the acoustic model to obtain the acoustic feature, and finally, the voice is synthesized through the vocoder.

As an improvement, the text processing module comprises text data, front-end processing and language feature construction;

the voice processing module comprises voice data and acoustic feature extraction.

As an improvement, the text input module comprises input text data, input text front-end processing and construction of input language features;

the synthesized speech module includes generating acoustic features, a vocoder, and synthesized speech.

Compared with the prior art, the invention has the advantages that: the end-to-end speech synthesis speech based on deep learning has high naturalness, the HMM-based method has good system stability, the front end part of the system utilizes the HMM to obtain the inherent language features of Uygur, the back end synthesis part utilizes a deep neural network framework to establish the speech synthesis method of an autoregressive model, the effect is best, the continuity and the stability of the synthesized speech are obviously superior to those of the parameter synthesis method and the end-to-end synthesis method, and the naturalness achieves a satisfactory effect.

Drawings

FIG. 1 is a flow chart of a novel Uygur speech synthesis method of the present invention.

FIG. 2 is a system diagram of a Uygur speech synthesis system according to the present invention.

FIG. 3 is a flow chart of the construction of the linguistic features of a novel Uygur speech synthesis system of the present invention.

FIG. 4 is a diagram of feature and parameter prediction for a BilSTM-based speech synthesis method of a new Uygur speech synthesis system of the present invention.

FIG. 5 is a spectrogram of original speech of a novel Uygur speech synthesis system of the present invention.

FIG. 6 is a spectrogram of synthesized speech of a novel Uygur speech synthesis system of the present invention.

FIG. 7 is a diagram of a Uygur-Chinese language translation system of a novel Uygur speech synthesis system of the present invention.

Detailed Description

The new Uygur speech synthesis method of the present invention will be further described in detail with reference to the accompanying drawings.

With the attached drawings, the novel Uygur voice synthesis method comprises the following steps:

(1) forming a recurrent neural network by using two recurrent neural networks;

A novel Uygur speech synthesis system is characterized in that: comprises a training module and a synthesis module;

The training module comprises a database, a text processing module and a voice processing module.

The synthesis module comprises a regression model, a text input module and a synthesized voice module.

The construction of the language features comprises the following steps:

A. performing front-end text processing and generating a corresponding label file;

The encoding processing of the markup file and the mapping of each context tag to a feature vector comprise the following steps:

The extraction of the acoustic features comprises the following steps:

a. reading spectral envelope information of a speech signal using a vocoder;

And the regression model generates state duration characteristics through the state duration model, and finally, characteristics obtained by combining the state duration characteristics and the language characteristics are input into the acoustic model to obtain the acoustic characteristics, and finally, the voice is synthesized through the vocoder.

The text processing module comprises text data, front-end processing and language feature construction;

The text input module comprises input text data, input text front-end processing and input language feature construction;

The specific implementation mode of the invention is as follows:

the first embodiment as shown in fig. 1-6: the method comprises the steps of extracting language feature information by adopting a designed Uygur language front-end text processing module, then carrying out language feature vectorization, acoustic and language feature normalization by a Merlin neural network-based acoustic modeling module, training an acoustic model, synthesizing voice by using a WORLD synthesizer, and building a Uygur language synthesis system based on a neural network.

Tests are carried out by adopting different neural network frameworks, the output characteristics MCC, BAP, log F0 and other characteristics of the neural network are compared, objective evaluation is carried out on the synthesized voice, and a benchmark neural network model used in the test is as follows:

1) feedforward neural network dnn (deep neural network): the simplest type of feedforward neural network includes multiple layers of hidden layers between the input and output layers.

2) Long Short Term memory neural network lstm (long Short Term memory): it is a variant of the recurrent neural network that replaces the hidden layer in the RNN by a memory unit [65], so that the network learns to store updated data information and forget history information.

3) Bidirectional long-short term memory neural network BilSTM: is formed by combining a forward LSTM and a backward LSTM. Both are often used to model context information in natural language processing tasks.

Because the neural network speech synthesis has high requirement on the size of the corpus scale, the corpus scale is enlarged in the experiment, two-year news texts are collected, then the texts are screened, special symbols and unknown words are processed through text standardization, and 7200 sentences are arranged. The recording work is carried out in a television station live broadcast room, a speaker is a broadcaster of the television station, and the recording equipment comprises: recording software: powereditor (infomedia) audio processing software. And (3) a sound console: STUDER OnAir 2500. A microphone: vicce Model 309A. Voice file indexes: the audio parameters were 48000Hz, 1536 Kbps. The number of data bits is 6 bits. The sampling rate was 16 khz. The channel is mono. 7200 sentences and corresponding sound files as training set, and 100 sentences as test set. The first time, the neural network based on DNN is adopted for training. The input to the neural network is 486-dimensional Uygur language features, including information such as phone context, syllables, words, prosodic phrases and parts of speech. The output features of the neural network, 60-dimensional MCC, 5-dimensional BAP, and log F0 features are extracted over a 5 millisecond frame interval.

In this experiment, DNN neural network, LSTM neural network model and BilSTM neural network framework were used for training. The parameter settings of the training model are shown in table 1:

TABLE 1 parameter settings for training models

The output characteristics of the neural network, such as MCC, BAP, log F0, are compared, and the synthesized speech is objectively evaluated. Table 2 shows the DNN-based speech synthesis method results, table 3 shows the LSTM-based speech synthesis method results, and table 4 shows the BiLSTM-based speech synthesis method results:

TABLE 2 DNN-based Speech Synthesis method results

TABLE 3 LSTM-based speech synthesis method

TABLE 4 BilSTM-based speech synthesis method

Embodiment two as shown in figures 1, 2 and 7: the BilSTM-based Uygur speech synthesis system is successfully applied to an Uygur language-Chinese speech translation system, and the naturalness of the speech synthesized by the language translation system is improved. The Uygur language-Chinese sound translation system mainly comprises three modules of voice recognition, voice synthesis, machine translation and the like.

1) And in the voice recognition module, recognizing the input voice signal through a voice recognition system, and converting the voice signal into a text.

2) And in the machine translation module, translating the output text of the voice recognition system through a machine translation system.

3) The speech synthesis module takes the text translated by the machine translation module as an input text, and converts the text content into speech through a speech synthesis system.

The Uygur language-Chinese sound translation system is suitable for application requirements of the masses, android versions are developed by software at present for users to download and use, interaction is carried out through mobile equipment, a software interface is simple and convenient, and the users are clear at a glance in the using process. The system has the recognition rate of more than 95 percent, is an unspecified voice recognition system, does not need to appoint human voice, and can recognize the Uyghur of any human. The translation part is quickly translated through an end-to-end neural network. The speech synthesis part converts input text into speech through a neural network system based on BilSTM, and the naturalness and the clarity of the synthesis effect reach satisfactory levels. The Uygur voice synthesis system researched by the thesis is embedded into the voice translation system, so that the application value is improved.

The present invention and its embodiments have been described above, and the description is not intended to be limiting, and the drawings are only one embodiment of the present invention, and the actual structure is not limited thereto. In summary, those skilled in the art should appreciate that they can readily use the disclosed conception and specific embodiments as a basis for designing or modifying other structures for carrying out the same purposes of the present invention without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A novel Uygur voice synthesis method is characterized in that: the method comprises the following steps:

(1) forming a recurrent neural network by using two recurrent neural networks;

(4) and expanding memory by a long-time memory network, and using an LSTM unit as a construction unit of an RNN layer.

2. The novel Uygur speech synthesis system as claimed in claim 1, wherein: comprises a training module and a synthesis module;

3. The novel Uygur speech synthesis system as claimed in claim 2, wherein: the training module comprises a database, a text processing module and a voice processing module.

4. The novel Uygur speech synthesis system as claimed in claim 2, wherein: the synthesis module comprises a regression model, a text input module and a synthesized voice module.

5. The novel Uygur speech synthesis system as claimed in claim 2, wherein: the construction of the language features comprises the following steps:

6. The novel Uygur speech synthesis system as claimed in claim 5, wherein: the encoding processing of the markup file and the mapping of each context tag to a feature vector comprise the following steps:

7. The novel Uygur speech synthesis system as claimed in claim 2, wherein: the extraction of the acoustic features comprises the following steps:

a. reading spectral envelope information of a speech signal using a vocoder;

8. A novel uygur speech synthesis system according to claims 2-4, wherein: and the regression model generates state duration characteristics through the state duration model, and finally, characteristics obtained by combining the state duration characteristics and the language characteristics are input into the acoustic model to obtain the acoustic characteristics, and finally, the voice is synthesized through the vocoder.

9. The novel Uygur speech synthesis system as claimed in claim 3, wherein: the text processing module comprises text data, front-end processing and language feature construction;

10. The novel Uygur speech synthesis system as claimed in claim 4, wherein: the text input module comprises input text data, input text front-end processing and input language feature construction;