CN112927673A - Novel Uygur voice synthesis method - Google Patents

Novel Uygur voice synthesis method Download PDF

Info

Publication number
CN112927673A
CN112927673A CN202110180854.2A CN202110180854A CN112927673A CN 112927673 A CN112927673 A CN 112927673A CN 202110180854 A CN202110180854 A CN 202110180854A CN 112927673 A CN112927673 A CN 112927673A
Authority
CN
China
Prior art keywords
uygur
text
novel
speech
language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202110180854.2A
Other languages
Chinese (zh)
Inventor
帕丽旦·木合塔尔
买买提阿依甫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xinjiang University Of Finance & Economics
Original Assignee
Xinjiang University Of Finance & Economics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xinjiang University Of Finance & Economics filed Critical Xinjiang University Of Finance & Economics
Priority to CN202110180854.2A priority Critical patent/CN112927673A/en
Publication of CN112927673A publication Critical patent/CN112927673A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/027Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The invention discloses a novel Uygur speech synthesis method, the end-to-end speech synthesis speech based on deep learning has high naturalness, the HMM-based method has good system stability, the front end part of the system utilizes the HMM to obtain the inherent language characteristics of the Uygur speech, the back end synthesis part utilizes a deep neural network framework to establish the speech synthesis method of an autoregressive model, the continuity and the stability of the synthesized speech are obviously superior to those of the parameter synthesis method and the end-to-end synthesis method, and the naturalness achieves satisfactory effect.

Description

Novel Uygur voice synthesis method
Technical Field
The invention relates to the field of artificial intelligence, in particular to a novel Uygur voice synthesis method.
Background
An artificial neural network is a mathematical model based on the simulation of the biological nervous system. Reflects some basic characteristics of brain biological system to some extent, and is a network structure related to biodiversity process. In 2012, Hiton et al successfully applied deep learning to speech recognition and greatly improved the recognition rate, and then generated a speech synthesis method based on neural networks.
Among the neural network-based speech synthesis methods, the most commonly used are a recurrent neural network-based method (DNN), a recurrent neural network-based (RNN) method, and a long-term memory network-based (LSTM) method. The method for synthesizing a bidirectional RNN proposed by Schuster, which is used as a sequence learner, can encode context information of a current frame to generate a bidirectional sequence and perform modeling through sequence learning. Hochreiter et al, to solve the problem of gradient disappearance of conventional RNN, propose a new Long Short Term Memory LSTM (Short-time Memory) structure. A more recent popular approach is to employ deep belief networks (DBMs) while modeling the relationship of speech and acoustic features. And estimating the function prediction acoustic features through a real-value audio track neural autoregressive density and a deep mixing density network. A deep feed-forward neural network can be viewed as an alternative to decision trees in HMM-based speech. The recurrent neural network is used to express TTS as a sequence-to-sequence mapping problem. Under the constraint of adding context, the long-time memory network combines a Gate Recursion Unit (GRU) based on RNN with a mixed density model to predict the sequence of the probability density function. Therefore, the technical problem of the above reaction is a problem to be solved by those skilled in the art.
Disclosure of Invention
The technical problem to be solved by the invention is to overcome the defects of the technology and provide a novel Uygur speech synthesis method.
In order to solve the technical problems, the technical scheme provided by the invention is a novel Uygur speech synthesis method: the method comprises the following steps:
(1) forming a recurrent neural network by using two recurrent neural networks;
(2) encoding in a source language and decoding in a target language using a recurrent neural network, the encoder mapping variable length linear sequences to fixed length vectors and the decoder mapping vector representations to variable length target sequences;
(3) reading forward from the starting point of the text sequence by using the RNN, and reading from the end point of the text sequence by using another RNN model;
(5) and expanding memory by a long-time memory network, and using an LSTM unit as a construction unit of an RNN layer.
A novel Uygur voice synthesis system comprises a training module and a synthesis module;
the training module is used for constructing language features and extracting acoustic features, and sending data to the synthesis part;
the synthesis module is used for inputting data and receiving the data sent by the training module to synthesize voice.
As an improvement, the training module comprises a database, a text processing module and a voice processing module.
As an improvement, the synthesis module comprises a regression model, a text input module and a synthesized voice module.
As an improvement, the construction of the language features comprises the following steps:
carrying out front-end text processing and generating a corresponding label file;
B. coding the label file, mapping each context label to a feature vector as the input of DNN language feature vector
C. Performing up-sampling processing to construct and complete language features;
D. and normalizing the language features by adopting a minimum and maximum normalization mode.
As an improvement, the encoding process of the markup document, and mapping each context tag to a feature vector, includes the following steps:
1) extracting phonemes and context features from the text using a front-end tool;
2) aligning the text and the audio of the training data to obtain the start time and the end time of each phoneme;
3) and converting the phoneme structured representation generated by the front-end tool into a corresponding file, and using the same annotation file format.
As an improvement, the extraction of the acoustic features comprises the following steps:
the new Uygur speech synthesis method includes reading the spectral envelope information of the speech signal with the vocoder;
b. converting the MFCC characteristics into MGC parameters, and extracting spectral envelope information;
C. and extracting aperiodic features with variable dimensions, and then converting the voice fundamental frequency features.
As an improvement, the regression model generates a state duration feature through the state duration model, and finally, the feature obtained by combining the state duration feature and the language feature is input into the acoustic model to obtain the acoustic feature, and finally, the voice is synthesized through the vocoder.
As an improvement, the text processing module comprises text data, front-end processing and language feature construction;
the voice processing module comprises voice data and acoustic feature extraction.
As an improvement, the text input module comprises input text data, input text front-end processing and construction of input language features;
the synthesized speech module includes generating acoustic features, a vocoder, and synthesized speech.
Compared with the prior art, the invention has the advantages that: the end-to-end speech synthesis speech based on deep learning has high naturalness, the HMM-based method has good system stability, the front end part of the system utilizes the HMM to obtain the inherent language features of Uygur, the back end synthesis part utilizes a deep neural network framework to establish the speech synthesis method of an autoregressive model, the effect is best, the continuity and the stability of the synthesized speech are obviously superior to those of the parameter synthesis method and the end-to-end synthesis method, and the naturalness achieves a satisfactory effect.
Drawings
FIG. 1 is a flow chart of a novel Uygur speech synthesis method of the present invention.
FIG. 2 is a system diagram of a Uygur speech synthesis system according to the present invention.
FIG. 3 is a flow chart of the construction of the linguistic features of a novel Uygur speech synthesis system of the present invention.
FIG. 4 is a diagram of feature and parameter prediction for a BilSTM-based speech synthesis method of a new Uygur speech synthesis system of the present invention.
FIG. 5 is a spectrogram of original speech of a novel Uygur speech synthesis system of the present invention.
FIG. 6 is a spectrogram of synthesized speech of a novel Uygur speech synthesis system of the present invention.
FIG. 7 is a diagram of a Uygur-Chinese language translation system of a novel Uygur speech synthesis system of the present invention.
Detailed Description
The new Uygur speech synthesis method of the present invention will be further described in detail with reference to the accompanying drawings.
With the attached drawings, the novel Uygur voice synthesis method comprises the following steps:
(1) forming a recurrent neural network by using two recurrent neural networks;
(2) encoding in a source language and decoding in a target language using a recurrent neural network, the encoder mapping variable length linear sequences to fixed length vectors and the decoder mapping vector representations to variable length target sequences;
(3) reading forward from the starting point of the text sequence by using the RNN, and reading from the end point of the text sequence by using another RNN model;
(5) and expanding memory by a long-time memory network, and using an LSTM unit as a construction unit of an RNN layer.
A novel Uygur speech synthesis system is characterized in that: comprises a training module and a synthesis module;
the training module is used for constructing language features and extracting acoustic features, and sending data to the synthesis part;
the synthesis module is used for inputting data and receiving the data sent by the training module to synthesize voice.
The training module comprises a database, a text processing module and a voice processing module.
The synthesis module comprises a regression model, a text input module and a synthesized voice module.
The construction of the language features comprises the following steps:
A. performing front-end text processing and generating a corresponding label file;
B. coding the label file, mapping each context label to a feature vector as the input of DNN language feature vector
C. Performing up-sampling processing to construct and complete language features;
D. and normalizing the language features by adopting a minimum and maximum normalization mode.
The encoding processing of the markup file and the mapping of each context tag to a feature vector comprise the following steps:
1) extracting phonemes and context features from the text using a front-end tool;
2) aligning the text and the audio of the training data to obtain the start time and the end time of each phoneme;
3) and converting the phoneme structured representation generated by the front-end tool into a corresponding file, and using the same annotation file format.
The extraction of the acoustic features comprises the following steps:
a. reading spectral envelope information of a speech signal using a vocoder;
b. converting the MFCC characteristics into MGC parameters, and extracting spectral envelope information;
C. and extracting aperiodic features with variable dimensions, and then converting the voice fundamental frequency features.
And the regression model generates state duration characteristics through the state duration model, and finally, characteristics obtained by combining the state duration characteristics and the language characteristics are input into the acoustic model to obtain the acoustic characteristics, and finally, the voice is synthesized through the vocoder.
The text processing module comprises text data, front-end processing and language feature construction;
the voice processing module comprises voice data and acoustic feature extraction.
The text input module comprises input text data, input text front-end processing and input language feature construction;
the synthesized speech module includes generating acoustic features, a vocoder, and synthesized speech.
The specific implementation mode of the invention is as follows:
the first embodiment as shown in fig. 1-6: the method comprises the steps of extracting language feature information by adopting a designed Uygur language front-end text processing module, then carrying out language feature vectorization, acoustic and language feature normalization by a Merlin neural network-based acoustic modeling module, training an acoustic model, synthesizing voice by using a WORLD synthesizer, and building a Uygur language synthesis system based on a neural network.
Tests are carried out by adopting different neural network frameworks, the output characteristics MCC, BAP, log F0 and other characteristics of the neural network are compared, objective evaluation is carried out on the synthesized voice, and a benchmark neural network model used in the test is as follows:
1) feedforward neural network dnn (deep neural network): the simplest type of feedforward neural network includes multiple layers of hidden layers between the input and output layers.
2) Long Short Term memory neural network lstm (long Short Term memory): it is a variant of the recurrent neural network that replaces the hidden layer in the RNN by a memory unit [65], so that the network learns to store updated data information and forget history information.
3) Bidirectional long-short term memory neural network BilSTM: is formed by combining a forward LSTM and a backward LSTM. Both are often used to model context information in natural language processing tasks.
Because the neural network speech synthesis has high requirement on the size of the corpus scale, the corpus scale is enlarged in the experiment, two-year news texts are collected, then the texts are screened, special symbols and unknown words are processed through text standardization, and 7200 sentences are arranged. The recording work is carried out in a television station live broadcast room, a speaker is a broadcaster of the television station, and the recording equipment comprises: recording software: powereditor (infomedia) audio processing software. And (3) a sound console: STUDER OnAir 2500. A microphone: vicce Model 309A. Voice file indexes: the audio parameters were 48000Hz, 1536 Kbps. The number of data bits is 6 bits. The sampling rate was 16 khz. The channel is mono. 7200 sentences and corresponding sound files as training set, and 100 sentences as test set. The first time, the neural network based on DNN is adopted for training. The input to the neural network is 486-dimensional Uygur language features, including information such as phone context, syllables, words, prosodic phrases and parts of speech. The output features of the neural network, 60-dimensional MCC, 5-dimensional BAP, and log F0 features are extracted over a 5 millisecond frame interval.
In this experiment, DNN neural network, LSTM neural network model and BilSTM neural network framework were used for training. The parameter settings of the training model are shown in table 1:
TABLE 1 parameter settings for training models
Figure BDA0002941445090000041
Figure BDA0002941445090000051
The output characteristics of the neural network, such as MCC, BAP, log F0, are compared, and the synthesized speech is objectively evaluated. Table 2 shows the DNN-based speech synthesis method results, table 3 shows the LSTM-based speech synthesis method results, and table 4 shows the BiLSTM-based speech synthesis method results:
TABLE 2 DNN-based Speech Synthesis method results
Figure BDA0002941445090000052
TABLE 3 LSTM-based speech synthesis method
Figure BDA0002941445090000053
TABLE 4 BilSTM-based speech synthesis method
Figure BDA0002941445090000054
Embodiment two as shown in figures 1, 2 and 7: the BilSTM-based Uygur speech synthesis system is successfully applied to an Uygur language-Chinese speech translation system, and the naturalness of the speech synthesized by the language translation system is improved. The Uygur language-Chinese sound translation system mainly comprises three modules of voice recognition, voice synthesis, machine translation and the like.
1) And in the voice recognition module, recognizing the input voice signal through a voice recognition system, and converting the voice signal into a text.
2) And in the machine translation module, translating the output text of the voice recognition system through a machine translation system.
3) The speech synthesis module takes the text translated by the machine translation module as an input text, and converts the text content into speech through a speech synthesis system.
The Uygur language-Chinese sound translation system is suitable for application requirements of the masses, android versions are developed by software at present for users to download and use, interaction is carried out through mobile equipment, a software interface is simple and convenient, and the users are clear at a glance in the using process. The system has the recognition rate of more than 95 percent, is an unspecified voice recognition system, does not need to appoint human voice, and can recognize the Uyghur of any human. The translation part is quickly translated through an end-to-end neural network. The speech synthesis part converts input text into speech through a neural network system based on BilSTM, and the naturalness and the clarity of the synthesis effect reach satisfactory levels. The Uygur voice synthesis system researched by the thesis is embedded into the voice translation system, so that the application value is improved.
The present invention and its embodiments have been described above, and the description is not intended to be limiting, and the drawings are only one embodiment of the present invention, and the actual structure is not limited thereto. In summary, those skilled in the art should appreciate that they can readily use the disclosed conception and specific embodiments as a basis for designing or modifying other structures for carrying out the same purposes of the present invention without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. A novel Uygur voice synthesis method is characterized in that: the method comprises the following steps:
(1) forming a recurrent neural network by using two recurrent neural networks;
(2) encoding in a source language and decoding in a target language using a recurrent neural network, the encoder mapping variable length linear sequences to fixed length vectors and the decoder mapping vector representations to variable length target sequences;
(3) reading forward from the starting point of the text sequence by using the RNN, and reading from the end point of the text sequence by using another RNN model;
(4) and expanding memory by a long-time memory network, and using an LSTM unit as a construction unit of an RNN layer.
2. The novel Uygur speech synthesis system as claimed in claim 1, wherein: comprises a training module and a synthesis module;
the training module is used for constructing language features and extracting acoustic features, and sending data to the synthesis part;
the synthesis module is used for inputting data and receiving the data sent by the training module to synthesize voice.
3. The novel Uygur speech synthesis system as claimed in claim 2, wherein: the training module comprises a database, a text processing module and a voice processing module.
4. The novel Uygur speech synthesis system as claimed in claim 2, wherein: the synthesis module comprises a regression model, a text input module and a synthesized voice module.
5. The novel Uygur speech synthesis system as claimed in claim 2, wherein: the construction of the language features comprises the following steps:
A. performing front-end text processing and generating a corresponding label file;
B. coding the label file, mapping each context label to a feature vector as the input of DNN language feature vector
C. Performing up-sampling processing to construct and complete language features;
D. and normalizing the language features by adopting a minimum and maximum normalization mode.
6. The novel Uygur speech synthesis system as claimed in claim 5, wherein: the encoding processing of the markup file and the mapping of each context tag to a feature vector comprise the following steps:
1) extracting phonemes and context features from the text using a front-end tool;
2) aligning the text and the audio of the training data to obtain the start time and the end time of each phoneme;
3) and converting the phoneme structured representation generated by the front-end tool into a corresponding file, and using the same annotation file format.
7. The novel Uygur speech synthesis system as claimed in claim 2, wherein: the extraction of the acoustic features comprises the following steps:
a. reading spectral envelope information of a speech signal using a vocoder;
b. converting the MFCC characteristics into MGC parameters, and extracting spectral envelope information;
C. and extracting aperiodic features with variable dimensions, and then converting the voice fundamental frequency features.
8. A novel uygur speech synthesis system according to claims 2-4, wherein: and the regression model generates state duration characteristics through the state duration model, and finally, characteristics obtained by combining the state duration characteristics and the language characteristics are input into the acoustic model to obtain the acoustic characteristics, and finally, the voice is synthesized through the vocoder.
9. The novel Uygur speech synthesis system as claimed in claim 3, wherein: the text processing module comprises text data, front-end processing and language feature construction;
the voice processing module comprises voice data and acoustic feature extraction.
10. The novel Uygur speech synthesis system as claimed in claim 4, wherein: the text input module comprises input text data, input text front-end processing and input language feature construction;
the synthesized speech module includes generating acoustic features, a vocoder, and synthesized speech.
CN202110180854.2A 2021-02-09 2021-02-09 Novel Uygur voice synthesis method Withdrawn CN112927673A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110180854.2A CN112927673A (en) 2021-02-09 2021-02-09 Novel Uygur voice synthesis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110180854.2A CN112927673A (en) 2021-02-09 2021-02-09 Novel Uygur voice synthesis method

Publications (1)

Publication Number Publication Date
CN112927673A true CN112927673A (en) 2021-06-08

Family

ID=76171449

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110180854.2A Withdrawn CN112927673A (en) 2021-02-09 2021-02-09 Novel Uygur voice synthesis method

Country Status (1)

Country Link
CN (1) CN112927673A (en)

Similar Documents

Publication Publication Date Title
CN111754976B (en) Rhythm control voice synthesis method, system and electronic device
CN110033755A (en) Phoneme synthesizing method, device, computer equipment and storage medium
CN108899009B (en) Chinese speech synthesis system based on phoneme
JP2022527970A (en) Speech synthesis methods, devices, and computer-readable storage media
CN108364632B (en) Emotional Chinese text voice synthesis method
CN115485766A (en) Speech synthesis prosody using BERT models
CN110767210A (en) Method and device for generating personalized voice
CN111785258B (en) Personalized voice translation method and device based on speaker characteristics
JP2023539888A (en) Synthetic data augmentation using voice conversion and speech recognition models
CN113112995B (en) Word acoustic feature system, and training method and system of word acoustic feature system
KR20230133362A (en) Generate diverse and natural text-to-speech conversion samples
KR102272554B1 (en) Method and system of text to multiple speech
CN113903326A (en) Speech synthesis method, apparatus, device and storage medium
JP5574344B2 (en) Speech synthesis apparatus, speech synthesis method and speech synthesis program based on one model speech recognition synthesis
Toledano et al. Initialization, training, and context-dependency in HMM-based formant tracking
CN115762471A (en) Voice synthesis method, device, equipment and storage medium
CN115359780A (en) Speech synthesis method, apparatus, computer device and storage medium
CN114743539A (en) Speech synthesis method, apparatus, device and storage medium
CN114974218A (en) Voice conversion model training method and device and voice conversion method and device
WO2008056604A1 (en) Sound collection system, sound collection method, and collection processing program
CN114005428A (en) Speech synthesis method, apparatus, electronic device, storage medium, and program product
CN112927673A (en) Novel Uygur voice synthesis method
Takaki et al. Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2012
CN117636842B (en) Voice synthesis system and method based on prosody emotion migration
Nthite et al. End-to-End Text-To-Speech synthesis for under resourced South African languages

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20210608