WO2019056500A1

WO2019056500A1 - Electronic apparatus, speech synthesis method, and computer readable storage medium

Info

Publication number: WO2019056500A1
Application number: PCT/CN2017/108766
Authority: WO
Inventors: 梁浩; 程宁; 王健宗; 肖京
Original assignee: 平安科技（深圳）有限公司
Priority date: 2017-09-25
Filing date: 2017-10-31
Publication date: 2019-03-28
Also published as: CN107564511A; CN107564511B

Abstract

Disclosed in the present application are an electronic apparatus, speech synthesis method, and storage medium. The method comprises: upon receiving a text to be synthesized, dividing sentences and phrases of the text to be synthesized into words, determining, according to a predetermined mapping relationship between words, pronunciation durations, and pronunciation fundamental frequencies, a pronunciation duration and pronunciation fundamental frequency corresponding to each of the words, and categorizing, according to a predetermined pronunciation dictionary, respective words into predetermined speech feature types; extracting, according to the speech feature and pronunciation duration of each word, a predetermined type of acoustic feature vector corresponding to the text to be synthesized; inputting, into a trained predetermined type identification model, the predetermined type of acoustic feature vector corresponding to the text to be synthesized , and identifying a voiceprint feature of the text to be synthesized; and generating, according to the identified voiceprint feature and the pronunciation fundamental frequencies of the words, speech corresponding to the text to be synthesized. The technical solution in the present application enables highly accurate, natural, and clear speech synthesis results.

Description

Electronic device, speech synthesis method, and computer readable storage medium

The present application is based on the priority of the Chinese Patent Application entitled "Electronic Device, Speech Synthesis Method and Computer Readable Storage Media", filed on September 25, 2017, with the application number of CN 201710874876. The content is incorporated herein by reference.

Technical field

The present application relates to the field of voice technologies, and in particular, to an electronic device, a voice synthesis method, and a computer readable storage medium.

Background technique

Speech synthesis technology, also known as text to speech (speech synthesis, TTS), aims to make the text information into artificial speech output through recognition and understanding, which is an important branch of modern artificial intelligence development. . Speech synthesis can play a great role in quality detection, machine question and answer, disability assistance and other fields, which is convenient for people's life. The naturalness and clarity of speech synthesis directly determine the effectiveness of technical application. At present, the existing speech synthesis scheme usually uses traditional mixed Gaussian technology to construct speech units. However, speech synthesis is basically to complete a modeling mapping from morpheme (linguistic space) to phoneme (acoustic space). A complex nonlinear mode mapping, using traditional hybrid Gaussian technology can not achieve high-precision, high-depth feature mining and expression, easy to make mistakes

Summary of the invention

The present application provides an electronic device, a speech synthesis method, and a computer readable storage medium, which are intended to have high precision, naturalness, and clarity of speech synthesis results.

A first aspect of the present application provides an electronic device, including a memory, a processor, and a memory synthesis system executable on the processor, where the voice synthesis system is implemented by the processor step:

After receiving the text to be synthesized for speech synthesis, the sentence and the phrase in the text to be synthesized are split into single words, and each mapping is determined according to a mapping relationship between the predetermined word, the length of the pronunciation, and the fundamental frequency of the pronunciation. The pronunciation duration and the pronunciation fundamental frequency of the single word are divided into preset type speech features according to a predetermined pronunciation dictionary, and the speech features of the individual words corresponding to the text to be synthesized are determined;

Extracting a preset type acoustic feature vector corresponding to the to-be-synthesized text according to a voice feature and a pronunciation duration of each single word corresponding to the text to be synthesized;

Inputting the preset type acoustic feature vector corresponding to the text to be synthesized into the trained preset type recognition model, and identifying the voiceprint feature corresponding to the text to be synthesized;

Generating according to the voiceprint feature corresponding to the text to be synthesized and the pronunciation fundamental frequency of each word The voice corresponding to the text to be synthesized.

A second aspect of the present application provides an automatic synthesized speech method, the method comprising the steps of:

The voice corresponding to the text to be synthesized is generated according to the voiceprint feature corresponding to the text to be synthesized and the pronunciation fundamental frequency of each single word.

A third aspect of the present application provides a computer readable storage medium storing a speech synthesis system, the speech synthesis system being executable by at least one processor to cause the at least one processor to execute as follows step:

The technical solution of the present application first splits the phrases and sentences in the text to be synthesized into single words, and determines the pronunciation fundamental frequency, pronunciation duration and speech features corresponding to each single word; then, according to the speech features and pronunciations of the individual words corresponding to the text to be synthesized The preset type acoustic feature vector corresponding to the text to be synthesized is extracted; and the extracted preset type acoustic feature vector is identified by the trained preset type recognition model, thereby identifying the sound corresponding to the text to be synthesized The pattern is obtained; finally, according to the voiceprint feature corresponding to the text to be synthesized and the pronunciation fundamental frequency of each word, the voice corresponding to the text to be synthesized is generated. Compared with the prior art method of constructing a speech unit by using the traditional mixed Gaussian technology, the present invention identifies a voiceprint feature corresponding to the text to be synthesized by using a trained preset type recognition model, and the preset type recognition model is a large amount of data in advance. Has been trained to be completed, therefore, the accuracy of the voiceprint feature corresponding to the recognized text to be synthesized is high, and further, according to the voiceprint corresponding to the text to be synthesized The feature and the pronunciation base frequency of each single word, the generated speech corresponding to the text to be synthesized, the naturalness and the definition are better, and are not easy to make mistakes.

DRAWINGS

1 is a schematic flow chart of a preferred embodiment of a speech synthesis method according to the present application;

2 is a schematic flowchart of a training process of a preset type recognition model in a preferred embodiment of the speech synthesis method of the present application;

3 is a schematic diagram of an operating environment of a preferred embodiment of a speech synthesis system of the present application;

4 is a block diagram of a program of a preferred embodiment of a speech synthesis system of the present application.

Detailed ways

The principles and features of the present application are described in the following with reference to the accompanying drawings, which are only used to explain the present application and are not intended to limit the scope of the application.

As shown in FIG. 1, FIG. 1 is a schematic flowchart of a voice synthesizing method according to a preferred embodiment of the present application.

In this embodiment, the voice synthesis method includes:

Step S10, after receiving the text to be synthesized for synthesizing the speech, splitting the sentence and the phrase in the text to be synthesized into a single word, according to a mapping relationship between the predetermined word, the length of the pronunciation, and the fundamental frequency of the pronunciation. Determining a pronunciation duration and a pronunciation fundamental frequency corresponding to each single word, and dividing each single word into a preset type speech feature according to a predetermined pronunciation dictionary, and determining a speech feature of each single word corresponding to the to-be-synthesized text;

Pronunciation fundamental frequency: Sometimes it can also be called pitch, which refers to the fundamental frequency of pronunciation. When the sounding body makes a sound due to vibration, the sound emitted can be decomposed into many simple sine waves, that is, all natural. The sound is basically composed of many sine waves with different frequencies, and the lowest frequency sine wave is the fundamental frequency. Phoneme: refers to the smallest phonetic unit based on the natural attributes of speech. From the perspective of acoustic properties, phoneme is the smallest unit of speech divided from the perspective of sound quality. From the physiological point of view, a pronunciation action forms a phoneme, such as [ Ma] contains two sounding actions [m] and [a]. The two phonemes are the same phoneme. The sounds emitted by different pronunciation actions are different phonemes, such as [ma-mi], two [m] The pronunciation is the same, it is the same phoneme, and [a][i] has different pronunciations and is different phonemes. For example, "Mandarin" consists of three syllables "pu, tong, hua", which can be analyzed into "p, u, t, o, ng, h, u, a" eight phonemes. In this embodiment, the pronunciation fundamental frequency and the pronunciation duration (ie, the sound length) of the single word may be determined by a pre-trained model, such as by a pre-trained Hidden Markov Model (HMM); the preset type Speech features, for example, may include syllables, phonemes, initials, and finals. After receiving the text to be synthesized for speech synthesis, the speech synthesis system splits the text sentence and the phrase in the text to be synthesized, and splits into a plurality of single words; the system has a predetermined pronunciation dictionary ( example For example, a Mandarin pronunciation dictionary, a Cantonese pronunciation dictionary, etc., and a mapping table between a predetermined word, a pronunciation duration, and a pronunciation fundamental frequency, the speech synthesis system splits the sentences and phrases in the text to be synthesized into single words. By searching the mapping table, the pronunciation duration and pronunciation audio corresponding to each single word can be found, and each word can be further divided into preset type speech features according to the predetermined pronunciation dictionary, thereby obtaining the corresponding text to be synthesized. The phonetic features of each word.

In step S20, the preset type acoustic feature vector corresponding to the text to be synthesized is extracted according to the voice features and the pronunciation duration of each word corresponding to the text to be synthesized;

For example, the preset type acoustic feature vector is an acoustic and linguistic feature vector, and the preset type acoustic feature vector includes the acoustic and linguistic feature vectors in Table 1 below, including: factor type, length, pitch , accent position, lip shape, finals | consonant type, pronunciation part, finals | consonant pronunciation, and whether accent, syllable position, position of phoneme in syllable, position of syllable in word.

Table 1 Example of acoustic feature vector

In step S30, the preset type acoustic feature vector corresponding to the text to be synthesized is input into the trained preset type recognition model, and the voiceprint feature corresponding to the text to be synthesized is identified;

The speech synthesis system pre-trains the preset type recognition model. The input type and output feature name of the preset type recognition model can be referred to the above table 1 during the training; the speech synthesis system extracts the preset type acoustic feature vector corresponding to the text to be synthesized. And inputting the extracted preset type acoustic feature vector into the trained preset type recognition model, the recognition model identifying the voiceprint feature corresponding to the text to be synthesized.

Step S40: Generate a voice corresponding to the text to be synthesized according to the voiceprint feature corresponding to the text to be synthesized and the pronunciation fundamental frequency of each single word.

After the speech synthesis system obtains the voiceprint feature corresponding to the text to be synthesized, the speech synthesis system can generate the speech corresponding to the synthesized text according to the obtained voiceprint feature and the pronunciation fundamental frequency of each single word, thus completing the text to be synthesized. Speech synthesis.

In the embodiment, the phrase and the sentence in the text to be synthesized are first divided into single words, and the pronunciation fundamental frequency, the pronunciation duration and the voice feature corresponding to each single word are determined; then, according to the voice features and pronunciations of the individual words corresponding to the text to be synthesized The preset type acoustic feature vector corresponding to the text to be synthesized is extracted; and the extracted preset type acoustic feature vector is identified by the trained preset type recognition model, thereby identifying the sound corresponding to the text to be synthesized The pattern is obtained; finally, according to the voiceprint feature corresponding to the text to be synthesized and the pronunciation fundamental frequency of each word, the voice corresponding to the text to be synthesized is generated. Compared with the prior art method of constructing a speech unit by using a conventional hybrid Gaussian technique, the solution of the embodiment identifies the voiceprint feature corresponding to the text to be synthesized by using the trained preset type recognition model, and the preset type recognition model is The data has been trained to be completed by a large amount of data. Therefore, the accuracy of the voiceprint feature corresponding to the text to be synthesized is high, and then, according to the voiceprint feature corresponding to the text to be synthesized and the pronunciation fundamental frequency of each word, the generated The speech corresponding to the synthesized text is better in naturalness and clarity, and is not easy to make mistakes.

Preferably, in this embodiment, the preset type recognition model is a deep feedforward network model (DNN), and the deep feedforward network model is a five-layer neural network, and the neural nodes of each layer The numbers are: 136L-75N-25S-75N-25L, L means using Linear Activation Function, N means using tanh Tangent Activation Function, and S means using sigmoid activation function.

Preferably, as shown in FIG. 2, the training process of the preset type recognition model is as follows:

Step E1: acquiring a preset number of training texts and corresponding training voices;

For example, the preset number is 100,000, that is, 100,000 training texts and training speech corresponding to the 100,000 training texts are obtained. In this embodiment, the training text includes, but is not limited to, a single word, a phrase, a sentence of Mandarin Chinese; for example, the training text may further include English letters, phrases, sentences, and the like.

In step E2, the sentences and phrases in each training text are split into single words, and each single word is split into preset type voice features according to a predetermined pronunciation dictionary, and the voice features of each single word corresponding to each training text are determined;

The speech synthesis system first splits the sentences and phrases in each training text into single words, and then splits each single word into preset type speech features through a predetermined pronunciation dictionary in the speech synthesis system, thereby determining each training text correspondingly. The voice features of the individual words; wherein the preset type of voice features include, for example, syllables, phonemes, initials, and finals.

Step E3, according to the mapping relationship between the predetermined word and the length of the pronunciation, Determining the length of the pronunciation corresponding to each single word, and extracting the preset type acoustic feature vector corresponding to each training text according to the voice features and the length of the pronunciation of each single word corresponding to each training text;

The speech synthesis system has a mapping table between the single word and the length of the pronunciation, according to the mapping table, the pronunciation duration of each single word corresponding to each training text can be queried; after determining the pronunciation duration of each single word corresponding to each training text, The speech synthesis system extracts the preset type acoustic feature vectors corresponding to the respective training texts according to the speech features and the pronunciation duration of each single word corresponding to each training text. For example, the preset type acoustic feature vector is an acoustic and linguistic feature vector, and the preset type acoustic feature vector specifically includes the acoustic and linguistic feature vectors in Table 1 above.

Step E4: processing each training speech by using a preset filter to extract a preset type of voiceprint feature of each training voice, and according to the mapping relationship between the training text and the training voice, the acoustic feature vector of each training text and the corresponding training The voiceprint features of the voice are correlated to obtain associated data of the acoustic feature vector and the voiceprint feature;

In this embodiment, the preset filter is, for example, a Mel filter. The speech synthesis system processes the training speech corresponding to each training text by using the preset filter to extract a preset type of voiceprint feature of each training voice, and then, according to the mapping relationship between the training text and the training voice, each training text is The acoustic feature vector is associated with the voiceprint feature of the corresponding training speech to obtain correlation data between the acoustic feature vector and the voiceprint feature. In this embodiment, the preset type voiceprint feature may be a Mel Frequency Cepstrum Coefficient (MFCC), and all coefficients of the training voice correspond to one feature matrix.

Step E5, the associated data is divided into a first percentage training set and a second percentage verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;

And dividing a training set and a verification set from the associated data of the voiceprint feature vector and the voiceprint feature, wherein the training set and the verification set respectively occupy a first percentage and a second percentage of the associated data The sum of the first percentage and the second percentage is less than or equal to 100%, that is, the entire associated data may be just divided into the training set and the verification set, or part of the associated data may be Divided into the training set and the validation set; for example, the first percentage is 65% and the second percentage is 30%.

Step E6: training the preset type recognition model by using the correlation data between the acoustic feature vector and the voiceprint feature in the training set, and using the verification set to perform the accuracy of the preset type recognition model of the training after the training is completed. verification;

The system trains the preset type recognition model by using the associated data of the acoustic feature vector and the voiceprint feature in the training set. After the training of the preset type recognition model is completed, the preset type recognition model is passed through the verification set. Verify the accuracy.

In step E7, if the accuracy rate is greater than the preset threshold, the model training ends;

If the verification set verifies the preset type recognition model, the accuracy rate obtained is super After the preset threshold (for example, 98.5%), the training effect of the preset type recognition model has reached the expected standard, and the model training is ended, and the speech synthesis system can apply the preset type recognition model of the training. .

Step E8: If the accuracy rate is less than or equal to the preset threshold, increase the number of the training text and the corresponding training voice, and re-execute the foregoing steps E2, E3, E4, E5, and E6 based on the added training text and the corresponding training voice. .

If the verification of the preset type identification model is verified by the verification set, the obtained accuracy rate is less than or equal to the preset threshold, indicating that the training effect of the preset type identification model has not reached the expected standard, and the number of training sets may not be sufficient or verified. The number of sets is not enough, so in this case, increase the number of training texts and corresponding training voices (for example, increase the fixed number each time or increase the random number each time), and then re-execute the above step E2 based on this. , E3, E4, E5, and E6, are executed in this loop until the requirement of step E7 is reached, and the model training is ended.

In this embodiment, the preset filter is a Mel filter (Mel filter); in the step E4, each training voice is processed by using a preset filter to extract a preset type voiceprint of each training voice. The steps of the feature include:

Pre-emphasis, framing, and windowing of each training speech;

Firstly, each training speech is pre-emphasized, framing and windowing; wherein pre-emphasis is to compensate the high-frequency components of the training speech.

For each windowing, the corresponding spectrum is obtained by Fourier transform;

Then, each window of each training speech is subjected to Fourier transform (ie, FFT transform) to obtain a corresponding spectrum.

The obtained spectrum is obtained by a Mel filter to obtain a Mel spectrum;

The spectrum obtained by the Fourier transform is then passed through a Mel filter, thus obtaining the Mel spectrum.

The cepstrum analysis is performed on the Mel spectrum to obtain the Mel frequency cepstrum coefficient MFCC, which is the voiceprint feature of the speech of the frame.

The cepstrum analysis of this embodiment includes taking logarithm and inverse transform. The actual inverse transform is generally implemented by DCT discrete cosine transform, and the second to thirteenth coefficients after DCT are taken as MFCC coefficients.

The application also proposes a speech synthesis system.

Please refer to FIG. 3 , which is a schematic diagram of an operating environment of a preferred embodiment of the speech synthesis system 10 of the present application.

In the present embodiment, the speech synthesis system 10 is installed and operated in the electronic device 1. The electronic device 1 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a server. The electronic device 1 may include, but is not limited to, a memory 11, a processor 12, and a display 13. Figure 3 shows only the electronic device 1 with components 11-13, but it should be understood that not all illustrated components may be implemented, alternative implementations may be more or less s component.

The memory 11 is a computer storage medium, which in some embodiments may be an internal storage unit of the electronic device 1, such as a hard disk or memory of the electronic device 1. The memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in hard disk equipped on the electronic device 1, a smart memory card (SMC), and a secure digital (SD). Card, flash card, etc. Further, the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device. The memory 11 is used to store application software installed in the electronic device 1 and various types of data, such as program codes of the speech synthesis system 10. The memory 11 can also be used to temporarily store data that has been output or is about to be output.

The processor 12, in some embodiments, may be a Central Processing Unit (CPU), microprocessor or other data processing chip for running program code or processing data stored in the memory 11, such as executing a speech synthesis system. 10 and so on.

The display 13 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch sensor, or the like in some embodiments. The display 13 is for displaying information processed in the electronic device 1 and a user interface for displaying visualization, such as a business customization interface or the like. The components 11-13 of the electronic device 1 communicate with one another via a system bus.

Please refer to FIG. 4, which is a program block diagram of a preferred embodiment of the speech synthesis system 10 of the present application. In the present embodiment, the speech synthesis system 10 can be divided into one or more modules, one or more modules being stored in the memory 11 and being executed by one or more processors (the processor 12 in this embodiment). Execute to complete this application. For example, in FIG. 4, the speech synthesis system 10 can be segmented into a determination module 101, an extraction module 102, an identification module 103, and a generation module 104. A module referred to in the present application refers to a series of computer program instruction segments capable of performing a specific function, and is more suitable than the program for describing the execution process of the speech synthesis system 10 in the electronic device 1, wherein:

The determining module 101 is configured to split the sentence and the phrase in the text to be synthesized into a single word after receiving the text to be synthesized for the speech synthesis, according to the predetermined single word, the pronunciation duration, and the pronunciation fundamental frequency. a mapping relationship, determining a pronunciation duration and a pronunciation fundamental frequency corresponding to each single word, and dividing each single word into a preset type of speech feature according to a predetermined pronunciation dictionary, and determining a speech feature of each single word corresponding to the to-be-synthesized text;

In this embodiment, the pronunciation fundamental frequency and the pronunciation duration (ie, the sound length) of the single word may be determined by a pre-trained model, such as by a pre-trained Hidden Markov Model (HMM); the preset type Speech features, for example, may include syllables, phonemes, initials, and finals. After receiving the text to be synthesized for speech synthesis, the speech synthesis system splits the text sentence and the phrase in the text to be synthesized, and splits into a plurality of single words; the system has a predetermined pronunciation dictionary ( For example, the Mandarin pronunciation dictionary, the Cantonese pronunciation dictionary, etc.) as well as predetermined words, pronunciation duration, The mapping table between the three basic frequencies of the pronunciation, the speech synthesis system splits the sentences and phrases in the text to be synthesized into single words, and then finds the pronunciation duration and pronunciation audio corresponding to each single word by searching the mapping table. And subdividing each word into a preset type of voice feature according to the predetermined pronunciation dictionary, thereby obtaining a voice feature of each word corresponding to the text to be synthesized.

The extraction module 102 is configured to extract a preset type acoustic feature vector corresponding to the text to be synthesized according to the voice features and the pronunciation duration of each single word corresponding to the text to be synthesized;

For example, the preset type acoustic feature vector is an acoustic and linguistic feature vector, and the preset type acoustic feature vector includes the acoustic and linguistic feature vectors in Table 2 below, including: factor type, length, pitch , accent position, lip shape, finals | consonant type, pronunciation part, finals | consonant pronunciation, and whether accent, syllable position, position of phoneme in syllable, position of syllable in word.

Table 2 Example of acoustic feature vector

The identification module 103 is configured to input the preset type acoustic feature vector corresponding to the text to be synthesized into the trained preset type recognition model, and identify the voiceprint feature corresponding to the text to be synthesized;

a generating module 104, configured to perform voiceprint features and individual orders according to the text to be synthesized The pronunciation of the word is based on the fundamental frequency, and the speech corresponding to the text to be synthesized is generated.

Specifically, the training process of the preset type recognition model in this embodiment is as follows:

In step E3, according to the mapping relationship between the predetermined word and the length of the pronunciation, the length of the pronunciation corresponding to each word is determined, and the words of each word corresponding to each training text are determined. a sound feature and a length of pronunciation, and extracting a preset type acoustic feature vector corresponding to each training text;

The speech synthesis system has a mapping table between the single word and the length of the pronunciation, according to the mapping table, the pronunciation duration of each single word corresponding to each training text can be queried; after determining the pronunciation duration of each single word corresponding to each training text, The speech synthesis system extracts the preset type acoustic feature vectors corresponding to the respective training texts according to the speech features and the pronunciation duration of each single word corresponding to each training text. For example, the preset type acoustic feature vector is an acoustic and linguistic feature vector, and the preset type acoustic feature vector specifically includes the acoustic and linguistic feature vectors in Table 2 above.

If the verification of the preset type recognition model by the verification set exceeds a preset threshold (for example, 98.5%), the training of the preset type recognition model is indicated. After the training effect has reached the expected standard, the model training is ended, and the speech synthesis system can apply the preset type recognition model of the training.

In this embodiment, the preset filter is a Mel filter (Mel filter); in the above step E4, each training speech is processed by using a preset filter to extract a preset type of voiceprint feature of each training voice. The steps include:

Pre-emphasis, framing, and windowing of each training speech;

The obtained spectrum is obtained by a Mel filter to obtain a Mel spectrum;

The present application also provides a computer readable storage medium storing a speech synthesis system, the speech synthesis system being executable by at least one processor to cause the at least one processor to perform any of the above implementations The speech synthesis method in the example.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the invention, and the equivalent structural transformation, or direct/indirect use, of the present invention and the contents of the drawings are used in the inventive concept of the present invention. It is included in the scope of the patent protection of the present invention in other related technical fields.

Claims

An electronic device, comprising: a memory, a processor, wherein the memory stores a speech synthesis system operable on the processor, when the speech synthesis system is executed by the processor Implement the following steps:

After receiving the text to be synthesized for speech synthesis, the sentence and the phrase in the text to be synthesized are split into single words, and each mapping is determined according to a mapping relationship between the predetermined word, the length of the pronunciation, and the fundamental frequency of the pronunciation. The pronunciation duration and the pronunciation fundamental frequency of the single word are divided into preset type speech features according to a predetermined pronunciation dictionary, and the speech features of the individual words corresponding to the text to be synthesized are determined;

Extracting a preset type acoustic feature vector corresponding to the to-be-synthesized text according to a voice feature and a pronunciation duration of each single word corresponding to the text to be synthesized;

Inputting the preset type acoustic feature vector corresponding to the text to be synthesized into the trained preset type recognition model, and identifying the voiceprint feature corresponding to the text to be synthesized;

The voice corresponding to the text to be synthesized is generated according to the voiceprint feature corresponding to the text to be synthesized and the pronunciation fundamental frequency of each single word.
The electronic device according to claim 1, wherein the training process of the preset type recognition model is as follows:

E1: acquiring a preset number of training texts and corresponding training voices;

E2, splitting the sentences and phrases in each training text into single words, and splitting each single word into preset type speech features according to a predetermined pronunciation dictionary, and determining the voice features of each single word corresponding to each training text;

E3. Determine a pronunciation duration corresponding to each single word according to a mapping relationship between the predetermined word and the length of the pronunciation, and extract a preset type acoustic corresponding to each training text according to the voice features and the length of the pronunciation of each single word corresponding to each training text. Feature vector;

E4: processing each training speech by using a preset filter to extract a preset type of voiceprint feature of each training voice, and according to the mapping relationship between the training text and the training voice, the acoustic feature vector of each training text and the corresponding training voice Correlation of the voiceprint features to obtain correlation data between the acoustic feature vector and the voiceprint feature;

E5. The associated data is divided into a first percentage training set and a second percentage verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;

E6. training the preset type recognition model by using the correlation data between the acoustic feature vector and the voiceprint feature in the training set, and verifying the accuracy of the preset type recognition model of the training by using the verification set after the training is completed. ;

E7. If the accuracy rate is greater than a preset threshold, the model training ends;

E8. If the accuracy rate is less than or equal to the preset threshold, increase the number of the training text and the corresponding training voice, and re-execute the foregoing steps E2, E3, E4, E5, and E6 based on the added training text and the corresponding training voice.
The electronic device according to claim 2, wherein said predetermined filtering The device is a Mel filter, and the step of processing each training speech by using a preset filter to extract a preset type of voiceprint feature of each training voice includes:

Pre-emphasis, framing, and windowing of each training speech;

For each windowing, the corresponding spectrum is obtained by Fourier transform;

The obtained spectrum is obtained by a Mel filter to obtain a Mel spectrum;

The cepstrum analysis is performed on the Mel spectrum to obtain the Mel frequency cepstrum coefficient MFCC, which is the voiceprint feature of the speech of the frame.
The electronic device of claim 3 wherein said cepstrum analysis comprises taking a logarithm and performing an inverse transform.
The electronic device according to claim 1, wherein the preset type recognition model is a deep feedforward network model, and the deep feedforward network model is a five-layer neural network, and the number of neuron nodes in each layer is respectively For: 136L-75N-25S-75N-25L, L means a linear activation function, N means a tangent activation function, and S means a sigmoid activation function.
The electronic device according to claim 5, wherein the training process of the preset type recognition model is as follows:

E1: acquiring a preset number of training texts and corresponding training voices;

E2, splitting the sentences and phrases in each training text into single words, and splitting each single word into preset type speech features according to a predetermined pronunciation dictionary, and determining the voice features of each single word corresponding to each training text;

E3. Determine a pronunciation duration corresponding to each single word according to a mapping relationship between the predetermined word and the length of the pronunciation, and extract a preset type acoustic corresponding to each training text according to the voice features and the length of the pronunciation of each single word corresponding to each training text. Feature vector;

E4: processing each training speech by using a preset filter to extract a preset type of voiceprint feature of each training voice, and according to the mapping relationship between the training text and the training voice, the acoustic feature vector of each training text and the corresponding training voice Correlation of the voiceprint features to obtain correlation data between the acoustic feature vector and the voiceprint feature;

E5. The associated data is divided into a first percentage training set and a second percentage verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;

E6. training the preset type recognition model by using the correlation data between the acoustic feature vector and the voiceprint feature in the training set, and verifying the accuracy of the preset type recognition model of the training by using the verification set after the training is completed. ;

E7. If the accuracy rate is greater than a preset threshold, the model training ends;

E8. If the accuracy rate is less than or equal to the preset threshold, increase the number of the training text and the corresponding training voice, and re-execute the foregoing steps E2, E3, E4, E5, and E6 based on the added training text and the corresponding training voice.
The electronic device according to claim 6, wherein the preset filter is a Mel filter, and the predetermined training filter is used to process each training voice. The steps of taking out the preset type voiceprint features of each training voice include:

Pre-emphasis, framing, and windowing of each training speech;

For each windowing, the corresponding spectrum is obtained by Fourier transform;

The obtained spectrum is obtained by a Mel filter to obtain a Mel spectrum;

The cepstrum analysis is performed on the Mel spectrum to obtain the Mel frequency cepstrum coefficient MFCC, which is the voiceprint feature of the speech of the frame.
The electronic device of claim 7 wherein said cepstrum analysis comprises taking a logarithm and performing an inverse transform.
An automatic synthesized speech method, characterized in that the method comprises the steps of:

After receiving the text to be synthesized for speech synthesis, the sentence and the phrase in the text to be synthesized are split into single words, and each mapping is determined according to a mapping relationship between the predetermined word, the length of the pronunciation, and the fundamental frequency of the pronunciation. The pronunciation duration and the pronunciation fundamental frequency of the single word are divided into preset type speech features according to a predetermined pronunciation dictionary, and the speech features of the individual words corresponding to the text to be synthesized are determined;

Extracting a preset type acoustic feature vector corresponding to the to-be-synthesized text according to a voice feature and a pronunciation duration of each single word corresponding to the text to be synthesized;

Inputting the preset type acoustic feature vector corresponding to the text to be synthesized into the trained preset type recognition model, and identifying the voiceprint feature corresponding to the text to be synthesized;

The voice corresponding to the text to be synthesized is generated according to the voiceprint feature corresponding to the text to be synthesized and the pronunciation fundamental frequency of each single word.
The speech synthesis method according to claim 9, wherein the training process of the preset type recognition model is as follows:

E1: acquiring a preset number of training texts and corresponding training voices;

E2, splitting the sentences and phrases in each training text into single words, and splitting each single word into preset type speech features according to a predetermined pronunciation dictionary, and determining the voice features of each single word corresponding to each training text;

E3. Determine a pronunciation duration corresponding to each single word according to a mapping relationship between the predetermined word and the length of the pronunciation, and extract a preset type acoustic corresponding to each training text according to the voice features and the length of the pronunciation of each single word corresponding to each training text. Feature vector;

E4: processing each training speech by using a preset filter to extract a preset type of voiceprint feature of each training voice, and according to the mapping relationship between the training text and the training voice, the acoustic feature vector of each training text and the corresponding training voice Correlation of the voiceprint features to obtain correlation data between the acoustic feature vector and the voiceprint feature;

E5. The associated data is divided into a first percentage training set and a second percentage verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;

E6. training the preset type recognition model by using the correlation data between the acoustic feature vector and the voiceprint feature in the training set, and verifying the accuracy of the preset type recognition model of the training by using the verification set after the training is completed. ;

E7. If the accuracy rate is greater than a preset threshold, the model training ends;

E8. If the accuracy rate is less than or equal to the preset threshold, increase the number of the training text and the corresponding training voice, and re-execute the foregoing steps E2, E3, E4, E5, and E6 based on the added training text and the corresponding training voice.
The speech synthesis method according to claim 10, wherein the preset filter is a Mel filter, and the predetermined training filter is used to process each training speech to extract a preset type of sound of each training speech. The steps of the pattern feature include:

Pre-emphasis, framing, and windowing of each training speech;

For each windowing, the corresponding spectrum is obtained by Fourier transform;

The obtained spectrum is obtained by a Mel filter to obtain a Mel spectrum;

The cepstrum analysis is performed on the Mel spectrum to obtain the Mel frequency cepstrum coefficient MFCC, which is the voiceprint feature of the speech of the frame.
The speech synthesis method according to claim 11, wherein said cepstrum analysis comprises taking a logarithm and performing an inverse transform.
The speech synthesis method according to claim 9, wherein the preset type recognition model is a deep feedforward network model, and the deep feedforward network model is a five-layer neural network, and the number of neuron nodes in each layer They are: 136L-75N-25S-75N-25L, L means a linear activation function, N means a tangent activation function, and S means a sigmoid activation function.
The speech synthesis method according to claim 13, wherein the training process of the preset type recognition model is as follows:

E1: acquiring a preset number of training texts and corresponding training voices;

E2, splitting the sentences and phrases in each training text into single words, and splitting each single word into preset type speech features according to a predetermined pronunciation dictionary, and determining the voice features of each single word corresponding to each training text;

E3. Determine a pronunciation duration corresponding to each single word according to a mapping relationship between the predetermined word and the length of the pronunciation, and extract a preset type acoustic corresponding to each training text according to the voice features and the length of the pronunciation of each single word corresponding to each training text. Feature vector;

E4: processing each training speech by using a preset filter to extract a preset type of voiceprint feature of each training voice, and according to the mapping relationship between the training text and the training voice, the acoustic feature vector of each training text and the corresponding training voice Correlation of the voiceprint features to obtain correlation data between the acoustic feature vector and the voiceprint feature;

E5. The associated data is divided into a first percentage training set and a second percentage verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;

E6. training the preset type recognition model by using the correlation data between the acoustic feature vector and the voiceprint feature in the training set, and verifying the accuracy of the preset type recognition model of the training by using the verification set after the training is completed. ;

E7. If the accuracy rate is greater than a preset threshold, the model training ends;

E8. If the accuracy rate is less than or equal to the preset threshold, increase the number of the training text and the corresponding training voice, and re-execute the foregoing steps E2, E3, E4, E5, and E6 based on the added training text and the corresponding training voice.
The speech synthesis method according to claim 14, wherein the preset filter is a Mel filter, and the predetermined training filter is used to process each training speech to extract a preset type of sound of each training speech. The steps of the pattern feature include:

Pre-emphasis, framing, and windowing of each training speech;

For each windowing, the corresponding spectrum is obtained by Fourier transform;

The obtained spectrum is obtained by a Mel filter to obtain a Mel spectrum;

The cepstrum analysis is performed on the Mel spectrum to obtain the Mel frequency cepstrum coefficient MFCC, which is the voiceprint feature of the speech of the frame.
The speech synthesis method according to claim 15, wherein said cepstrum analysis comprises taking a logarithm and performing an inverse transform.
A computer readable storage medium, characterized in that the computer readable storage medium stores a speech synthesis system, the speech synthesis system being executable by at least one processor to cause the at least one processor to perform the following steps:

After receiving the text to be synthesized for speech synthesis, the sentence and the phrase in the text to be synthesized are split into single words, and each mapping is determined according to a mapping relationship between the predetermined word, the length of the pronunciation, and the fundamental frequency of the pronunciation. The pronunciation duration and the pronunciation fundamental frequency of the single word are divided into preset type speech features according to a predetermined pronunciation dictionary, and the speech features of the individual words corresponding to the text to be synthesized are determined;

Extracting a preset type acoustic feature vector corresponding to the to-be-synthesized text according to a voice feature and a pronunciation duration of each single word corresponding to the text to be synthesized;

Inputting the preset type acoustic feature vector corresponding to the text to be synthesized into the trained preset type recognition model, and identifying the voiceprint feature corresponding to the text to be synthesized;

The voice corresponding to the text to be synthesized is generated according to the voiceprint feature corresponding to the text to be synthesized and the pronunciation fundamental frequency of each single word.
The computer readable storage medium according to claim 17, wherein the preset type recognition model is a deep feedforward network model, and the deep feedforward network model is a five-layer neural network, and each layer of neurons The number of nodes is: 136L-75N-25S-75N-25L, L means a linear activation function, N means a tangent activation function, and S means a sigmoid activation function.
The computer readable storage medium of claim 18, wherein the training process of the preset type recognition model is as follows:

E1: acquiring a preset number of training texts and corresponding training voices;

E2, splitting the sentences and phrases in each training text into single words, and splitting each single word into preset type speech features according to a predetermined pronunciation dictionary, and determining the voice features of each single word corresponding to each training text;

E3. Determine a pronunciation duration corresponding to each single word according to a mapping relationship between the predetermined word and the length of the pronunciation, and extract a preset type acoustic corresponding to each training text according to the voice features and the length of the pronunciation of each single word corresponding to each training text. Feature vector;

E4: processing each training speech by using a preset filter to extract a preset type of voiceprint feature of each training voice, and according to the mapping relationship between the training text and the training voice, the acoustic feature vector of each training text and the corresponding training voice Correlation of the voiceprint features to obtain correlation data between the acoustic feature vector and the voiceprint feature;

E5. The associated data is divided into a first percentage training set and a second percentage verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;

E6. training the preset type recognition model by using the correlation data between the acoustic feature vector and the voiceprint feature in the training set, and verifying the accuracy of the preset type recognition model of the training by using the verification set after the training is completed. ;

E7. If the accuracy rate is greater than a preset threshold, the model training ends;

E8. If the accuracy rate is less than or equal to the preset threshold, increase the number of the training text and the corresponding training voice, and re-execute the foregoing steps E2, E3, E4, E5, and E6 based on the added training text and the corresponding training voice.
The computer readable storage medium according to claim 19, wherein the preset filter is a Mel filter, and the preset training filter processes each training speech to extract a preset of each training voice. The steps of the type voiceprint feature include:

Pre-emphasis, framing, and windowing of each training speech;

For each windowing, the corresponding spectrum is obtained by Fourier transform;

The obtained spectrum is obtained by a Mel filter to obtain a Mel spectrum;

The cepstrum analysis is performed on the Mel spectrum to obtain the Mel frequency cepstrum coefficient MFCC, which is the voiceprint feature of the speech of the frame.