WO2019056500A1 - Electronic apparatus, speech synthesis method, and computer readable storage medium - Google Patents

Electronic apparatus, speech synthesis method, and computer readable storage medium Download PDF

Info

Publication number
WO2019056500A1
WO2019056500A1 PCT/CN2017/108766 CN2017108766W WO2019056500A1 WO 2019056500 A1 WO2019056500 A1 WO 2019056500A1 CN 2017108766 W CN2017108766 W CN 2017108766W WO 2019056500 A1 WO2019056500 A1 WO 2019056500A1
Authority
WO
WIPO (PCT)
Prior art keywords
training
text
pronunciation
speech
preset type
Prior art date
Application number
PCT/CN2017/108766
Other languages
French (fr)
Chinese (zh)
Inventor
梁浩
程宁
王健宗
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019056500A1 publication Critical patent/WO2019056500A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Definitions

  • the present application relates to the field of voice technologies, and in particular, to an electronic device, a voice synthesis method, and a computer readable storage medium.
  • Speech synthesis technology also known as text to speech (speech synthesis, TTS), aims to make the text information into artificial speech output through recognition and understanding, which is an important branch of modern artificial intelligence development. .
  • Speech synthesis can play a great role in quality detection, machine question and answer, disability assistance and other fields, which is convenient for people's life.
  • the naturalness and clarity of speech synthesis directly determine the effectiveness of technical application.
  • the existing speech synthesis scheme usually uses traditional mixed Gaussian technology to construct speech units.
  • speech synthesis is basically to complete a modeling mapping from morpheme (linguistic space) to phoneme (acoustic space).
  • a complex nonlinear mode mapping, using traditional hybrid Gaussian technology can not achieve high-precision, high-depth feature mining and expression, easy to make mistakes
  • the present application provides an electronic device, a speech synthesis method, and a computer readable storage medium, which are intended to have high precision, naturalness, and clarity of speech synthesis results.
  • a first aspect of the present application provides an electronic device, including a memory, a processor, and a memory synthesis system executable on the processor, where the voice synthesis system is implemented by the processor step:
  • the sentence and the phrase in the text to be synthesized are split into single words, and each mapping is determined according to a mapping relationship between the predetermined word, the length of the pronunciation, and the fundamental frequency of the pronunciation.
  • the pronunciation duration and the pronunciation fundamental frequency of the single word are divided into preset type speech features according to a predetermined pronunciation dictionary, and the speech features of the individual words corresponding to the text to be synthesized are determined;
  • a second aspect of the present application provides an automatic synthesized speech method, the method comprising the steps of:
  • the sentence and the phrase in the text to be synthesized are split into single words, and each mapping is determined according to a mapping relationship between the predetermined word, the length of the pronunciation, and the fundamental frequency of the pronunciation.
  • the pronunciation duration and the pronunciation fundamental frequency of the single word are divided into preset type speech features according to a predetermined pronunciation dictionary, and the speech features of the individual words corresponding to the text to be synthesized are determined;
  • the voice corresponding to the text to be synthesized is generated according to the voiceprint feature corresponding to the text to be synthesized and the pronunciation fundamental frequency of each single word.
  • a third aspect of the present application provides a computer readable storage medium storing a speech synthesis system, the speech synthesis system being executable by at least one processor to cause the at least one processor to execute as follows step:
  • the sentence and the phrase in the text to be synthesized are split into single words, and each mapping is determined according to a mapping relationship between the predetermined word, the length of the pronunciation, and the fundamental frequency of the pronunciation.
  • the pronunciation duration and the pronunciation fundamental frequency of the single word are divided into preset type speech features according to a predetermined pronunciation dictionary, and the speech features of the individual words corresponding to the text to be synthesized are determined;
  • the voice corresponding to the text to be synthesized is generated according to the voiceprint feature corresponding to the text to be synthesized and the pronunciation fundamental frequency of each single word.
  • the technical solution of the present application first splits the phrases and sentences in the text to be synthesized into single words, and determines the pronunciation fundamental frequency, pronunciation duration and speech features corresponding to each single word; then, according to the speech features and pronunciations of the individual words corresponding to the text to be synthesized
  • the preset type acoustic feature vector corresponding to the text to be synthesized is extracted; and the extracted preset type acoustic feature vector is identified by the trained preset type recognition model, thereby identifying the sound corresponding to the text to be synthesized
  • the pattern is obtained; finally, according to the voiceprint feature corresponding to the text to be synthesized and the pronunciation fundamental frequency of each word, the voice corresponding to the text to be synthesized is generated.
  • the present invention identifies a voiceprint feature corresponding to the text to be synthesized by using a trained preset type recognition model, and the preset type recognition model is a large amount of data in advance. Has been trained to be completed, therefore, the accuracy of the voiceprint feature corresponding to the recognized text to be synthesized is high, and further, according to the voiceprint corresponding to the text to be synthesized.
  • the feature and the pronunciation base frequency of each single word, the generated speech corresponding to the text to be synthesized, the naturalness and the definition are better, and are not easy to make mistakes.
  • FIG. 1 is a schematic flow chart of a preferred embodiment of a speech synthesis method according to the present application.
  • FIG. 2 is a schematic flowchart of a training process of a preset type recognition model in a preferred embodiment of the speech synthesis method of the present application;
  • FIG. 3 is a schematic diagram of an operating environment of a preferred embodiment of a speech synthesis system of the present application
  • FIG. 4 is a block diagram of a program of a preferred embodiment of a speech synthesis system of the present application.
  • FIG. 1 is a schematic flowchart of a voice synthesizing method according to a preferred embodiment of the present application.
  • the voice synthesis method includes:
  • Step S10 after receiving the text to be synthesized for synthesizing the speech, splitting the sentence and the phrase in the text to be synthesized into a single word, according to a mapping relationship between the predetermined word, the length of the pronunciation, and the fundamental frequency of the pronunciation. Determining a pronunciation duration and a pronunciation fundamental frequency corresponding to each single word, and dividing each single word into a preset type speech feature according to a predetermined pronunciation dictionary, and determining a speech feature of each single word corresponding to the to-be-synthesized text;
  • Pronunciation fundamental frequency Sometimes it can also be called pitch, which refers to the fundamental frequency of pronunciation.
  • pitch refers to the fundamental frequency of pronunciation.
  • the sound is basically composed of many sine waves with different frequencies, and the lowest frequency sine wave is the fundamental frequency.
  • Phoneme refers to the smallest phonetic unit based on the natural attributes of speech. From the perspective of acoustic properties, phoneme is the smallest unit of speech divided from the perspective of sound quality. From the physiological point of view, a pronunciation action forms a phoneme, such as [ Ma] contains two sounding actions [m] and [a]. The two phonemes are the same phoneme.
  • the sounds emitted by different pronunciation actions are different phonemes, such as [ma-mi], two [m]
  • the pronunciation is the same, it is the same phoneme, and [a][i] has different pronunciations and is different phonemes.
  • "Mandarin” consists of three syllables “pu, tong, hua”, which can be analyzed into “p, u, t, o, ng, h, u, a” eight phonemes.
  • the pronunciation fundamental frequency and the pronunciation duration (ie, the sound length) of the single word may be determined by a pre-trained model, such as by a pre-trained Hidden Markov Model (HMM); the preset type Speech features, for example, may include syllables, phonemes, initials, and finals.
  • HMM Hidden Markov Model
  • the speech synthesis system After receiving the text to be synthesized for speech synthesis, the speech synthesis system splits the text sentence and the phrase in the text to be synthesized, and splits into a plurality of single words; the system has a predetermined pronunciation dictionary ( example For example, a Mandarin pronunciation dictionary, a Cantonese pronunciation dictionary, etc., and a mapping table between a predetermined word, a pronunciation duration, and a pronunciation fundamental frequency, the speech synthesis system splits the sentences and phrases in the text to be synthesized into single words. By searching the mapping table, the pronunciation duration and pronunciation audio corresponding to each single word can be found, and each word can be further divided into preset type speech features according to the predetermined pronunciation dictionary, thereby obtaining the corresponding text to be synthesized. The phonetic features of each word.
  • a predetermined pronunciation dictionary example For example, a Mandarin pronunciation dictionary, a Cantonese pronunciation dictionary, etc.
  • a mapping table between a predetermined word, a pronunciation duration, and a pronunciation fundamental frequency
  • step S20 the preset type acoustic feature vector corresponding to the text to be synthesized is extracted according to the voice features and the pronunciation duration of each word corresponding to the text to be synthesized;
  • the preset type acoustic feature vector is an acoustic and linguistic feature vector
  • the preset type acoustic feature vector includes the acoustic and linguistic feature vectors in Table 1 below, including: factor type, length, pitch , accent position, lip shape, finals
  • step S30 the preset type acoustic feature vector corresponding to the text to be synthesized is input into the trained preset type recognition model, and the voiceprint feature corresponding to the text to be synthesized is identified;
  • the speech synthesis system pre-trains the preset type recognition model.
  • the input type and output feature name of the preset type recognition model can be referred to the above table 1 during the training; the speech synthesis system extracts the preset type acoustic feature vector corresponding to the text to be synthesized. And inputting the extracted preset type acoustic feature vector into the trained preset type recognition model, the recognition model identifying the voiceprint feature corresponding to the text to be synthesized.
  • Step S40 Generate a voice corresponding to the text to be synthesized according to the voiceprint feature corresponding to the text to be synthesized and the pronunciation fundamental frequency of each single word.
  • the speech synthesis system After the speech synthesis system obtains the voiceprint feature corresponding to the text to be synthesized, the speech synthesis system can generate the speech corresponding to the synthesized text according to the obtained voiceprint feature and the pronunciation fundamental frequency of each single word, thus completing the text to be synthesized. Speech synthesis.
  • the phrase and the sentence in the text to be synthesized are first divided into single words, and the pronunciation fundamental frequency, the pronunciation duration and the voice feature corresponding to each single word are determined; then, according to the voice features and pronunciations of the individual words corresponding to the text to be synthesized
  • the preset type acoustic feature vector corresponding to the text to be synthesized is extracted; and the extracted preset type acoustic feature vector is identified by the trained preset type recognition model, thereby identifying the sound corresponding to the text to be synthesized
  • the pattern is obtained; finally, according to the voiceprint feature corresponding to the text to be synthesized and the pronunciation fundamental frequency of each word, the voice corresponding to the text to be synthesized is generated.
  • the solution of the embodiment identifies the voiceprint feature corresponding to the text to be synthesized by using the trained preset type recognition model, and the preset type recognition model is The data has been trained to be completed by a large amount of data. Therefore, the accuracy of the voiceprint feature corresponding to the text to be synthesized is high, and then, according to the voiceprint feature corresponding to the text to be synthesized and the pronunciation fundamental frequency of each word, the generated The speech corresponding to the synthesized text is better in naturalness and clarity, and is not easy to make mistakes.
  • the preset type recognition model is a deep feedforward network model (DNN), and the deep feedforward network model is a five-layer neural network, and the neural nodes of each layer
  • the numbers are: 136L-75N-25S-75N-25L, L means using Linear Activation Function, N means using tanh Tangent Activation Function, and S means using sigmoid activation function.
  • the training process of the preset type recognition model is as follows:
  • Step E1 acquiring a preset number of training texts and corresponding training voices
  • the preset number is 100,000, that is, 100,000 training texts and training speech corresponding to the 100,000 training texts are obtained.
  • the training text includes, but is not limited to, a single word, a phrase, a sentence of Mandarin Chinese; for example, the training text may further include English letters, phrases, sentences, and the like.
  • step E2 the sentences and phrases in each training text are split into single words, and each single word is split into preset type voice features according to a predetermined pronunciation dictionary, and the voice features of each single word corresponding to each training text are determined;
  • the speech synthesis system first splits the sentences and phrases in each training text into single words, and then splits each single word into preset type speech features through a predetermined pronunciation dictionary in the speech synthesis system, thereby determining each training text correspondingly.
  • Step E3 according to the mapping relationship between the predetermined word and the length of the pronunciation, Determining the length of the pronunciation corresponding to each single word, and extracting the preset type acoustic feature vector corresponding to each training text according to the voice features and the length of the pronunciation of each single word corresponding to each training text;
  • the speech synthesis system has a mapping table between the single word and the length of the pronunciation, according to the mapping table, the pronunciation duration of each single word corresponding to each training text can be queried; after determining the pronunciation duration of each single word corresponding to each training text, The speech synthesis system extracts the preset type acoustic feature vectors corresponding to the respective training texts according to the speech features and the pronunciation duration of each single word corresponding to each training text.
  • the preset type acoustic feature vector is an acoustic and linguistic feature vector, and the preset type acoustic feature vector specifically includes the acoustic and linguistic feature vectors in Table 1 above.
  • Step E4 processing each training speech by using a preset filter to extract a preset type of voiceprint feature of each training voice, and according to the mapping relationship between the training text and the training voice, the acoustic feature vector of each training text and the corresponding training The voiceprint features of the voice are correlated to obtain associated data of the acoustic feature vector and the voiceprint feature;
  • the preset filter is, for example, a Mel filter.
  • the speech synthesis system processes the training speech corresponding to each training text by using the preset filter to extract a preset type of voiceprint feature of each training voice, and then, according to the mapping relationship between the training text and the training voice, each training text is The acoustic feature vector is associated with the voiceprint feature of the corresponding training speech to obtain correlation data between the acoustic feature vector and the voiceprint feature.
  • the preset type voiceprint feature may be a Mel Frequency Cepstrum Coefficient (MFCC), and all coefficients of the training voice correspond to one feature matrix.
  • MFCC Mel Frequency Cepstrum Coefficient
  • Step E5 the associated data is divided into a first percentage training set and a second percentage verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;
  • the training set and the verification set respectively occupy a first percentage and a second percentage of the associated data
  • the sum of the first percentage and the second percentage is less than or equal to 100%, that is, the entire associated data may be just divided into the training set and the verification set, or part of the associated data may be Divided into the training set and the validation set; for example, the first percentage is 65% and the second percentage is 30%.
  • Step E6 training the preset type recognition model by using the correlation data between the acoustic feature vector and the voiceprint feature in the training set, and using the verification set to perform the accuracy of the preset type recognition model of the training after the training is completed. verification;
  • the system trains the preset type recognition model by using the associated data of the acoustic feature vector and the voiceprint feature in the training set. After the training of the preset type recognition model is completed, the preset type recognition model is passed through the verification set. Verify the accuracy.
  • step E7 if the accuracy rate is greater than the preset threshold, the model training ends;
  • the verification set verifies the preset type recognition model, the accuracy rate obtained is super After the preset threshold (for example, 98.5%), the training effect of the preset type recognition model has reached the expected standard, and the model training is ended, and the speech synthesis system can apply the preset type recognition model of the training. .
  • the preset threshold for example, 98.5%
  • Step E8 If the accuracy rate is less than or equal to the preset threshold, increase the number of the training text and the corresponding training voice, and re-execute the foregoing steps E2, E3, E4, E5, and E6 based on the added training text and the corresponding training voice. .
  • the obtained accuracy rate is less than or equal to the preset threshold, indicating that the training effect of the preset type identification model has not reached the expected standard, and the number of training sets may not be sufficient or verified.
  • the number of sets is not enough, so in this case, increase the number of training texts and corresponding training voices (for example, increase the fixed number each time or increase the random number each time), and then re-execute the above step E2 based on this. , E3, E4, E5, and E6, are executed in this loop until the requirement of step E7 is reached, and the model training is ended.
  • the preset filter is a Mel filter (Mel filter); in the step E4, each training voice is processed by using a preset filter to extract a preset type voiceprint of each training voice.
  • the steps of the feature include:
  • each training speech is pre-emphasized, framing and windowing; wherein pre-emphasis is to compensate the high-frequency components of the training speech.
  • each window of each training speech is subjected to Fourier transform (ie, FFT transform) to obtain a corresponding spectrum.
  • Fourier transform ie, FFT transform
  • the obtained spectrum is obtained by a Mel filter to obtain a Mel spectrum
  • the spectrum obtained by the Fourier transform is then passed through a Mel filter, thus obtaining the Mel spectrum.
  • the cepstrum analysis is performed on the Mel spectrum to obtain the Mel frequency cepstrum coefficient MFCC, which is the voiceprint feature of the speech of the frame.
  • the cepstrum analysis of this embodiment includes taking logarithm and inverse transform.
  • the actual inverse transform is generally implemented by DCT discrete cosine transform, and the second to thirteenth coefficients after DCT are taken as MFCC coefficients.
  • the application also proposes a speech synthesis system.
  • FIG. 3 is a schematic diagram of an operating environment of a preferred embodiment of the speech synthesis system 10 of the present application.
  • the speech synthesis system 10 is installed and operated in the electronic device 1.
  • the electronic device 1 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a server.
  • the electronic device 1 may include, but is not limited to, a memory 11, a processor 12, and a display 13.
  • Figure 3 shows only the electronic device 1 with components 11-13, but it should be understood that not all illustrated components may be implemented, alternative implementations may be more or less s component.
  • the memory 11 is a computer storage medium, which in some embodiments may be an internal storage unit of the electronic device 1, such as a hard disk or memory of the electronic device 1.
  • the memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in hard disk equipped on the electronic device 1, a smart memory card (SMC), and a secure digital (SD). Card, flash card, etc. Further, the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device.
  • the memory 11 is used to store application software installed in the electronic device 1 and various types of data, such as program codes of the speech synthesis system 10.
  • the memory 11 can also be used to temporarily store data that has been output or is about to be output.
  • the processor 12 in some embodiments, may be a Central Processing Unit (CPU), microprocessor or other data processing chip for running program code or processing data stored in the memory 11, such as executing a speech synthesis system. 10 and so on.
  • CPU Central Processing Unit
  • microprocessor or other data processing chip for running program code or processing data stored in the memory 11, such as executing a speech synthesis system. 10 and so on.
  • the display 13 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch sensor, or the like in some embodiments.
  • the display 13 is for displaying information processed in the electronic device 1 and a user interface for displaying visualization, such as a business customization interface or the like.
  • the components 11-13 of the electronic device 1 communicate with one another via a system bus.
  • FIG. 4 is a program block diagram of a preferred embodiment of the speech synthesis system 10 of the present application.
  • the speech synthesis system 10 can be divided into one or more modules, one or more modules being stored in the memory 11 and being executed by one or more processors (the processor 12 in this embodiment). Execute to complete this application.
  • the speech synthesis system 10 can be segmented into a determination module 101, an extraction module 102, an identification module 103, and a generation module 104.
  • a module referred to in the present application refers to a series of computer program instruction segments capable of performing a specific function, and is more suitable than the program for describing the execution process of the speech synthesis system 10 in the electronic device 1, wherein:
  • the determining module 101 is configured to split the sentence and the phrase in the text to be synthesized into a single word after receiving the text to be synthesized for the speech synthesis, according to the predetermined single word, the pronunciation duration, and the pronunciation fundamental frequency. a mapping relationship, determining a pronunciation duration and a pronunciation fundamental frequency corresponding to each single word, and dividing each single word into a preset type of speech feature according to a predetermined pronunciation dictionary, and determining a speech feature of each single word corresponding to the to-be-synthesized text;
  • the pronunciation fundamental frequency and the pronunciation duration (ie, the sound length) of the single word may be determined by a pre-trained model, such as by a pre-trained Hidden Markov Model (HMM); the preset type Speech features, for example, may include syllables, phonemes, initials, and finals.
  • HMM Hidden Markov Model
  • the speech synthesis system After receiving the text to be synthesized for speech synthesis, the speech synthesis system splits the text sentence and the phrase in the text to be synthesized, and splits into a plurality of single words; the system has a predetermined pronunciation dictionary (for example, the Mandarin pronunciation dictionary, the Cantonese pronunciation dictionary, etc.) as well as predetermined words, pronunciation duration, The mapping table between the three basic frequencies of the pronunciation, the speech synthesis system splits the sentences and phrases in the text to be synthesized into single words, and then finds the pronunciation duration and pronunciation audio corresponding to each single word by searching the mapping table. And subdividing each word into a preset type of voice feature according to the predetermined pronunciation dictionary, thereby obtaining a voice feature of each word corresponding to the text to be synthesized.
  • a predetermined pronunciation dictionary For example, the Mandarin pronunciation dictionary, the Cantonese pronunciation dictionary, etc.
  • the extraction module 102 is configured to extract a preset type acoustic feature vector corresponding to the text to be synthesized according to the voice features and the pronunciation duration of each single word corresponding to the text to be synthesized;
  • the preset type acoustic feature vector is an acoustic and linguistic feature vector
  • the preset type acoustic feature vector includes the acoustic and linguistic feature vectors in Table 2 below, including: factor type, length, pitch , accent position, lip shape, finals
  • the identification module 103 is configured to input the preset type acoustic feature vector corresponding to the text to be synthesized into the trained preset type recognition model, and identify the voiceprint feature corresponding to the text to be synthesized;
  • the speech synthesis system pre-trains the preset type recognition model.
  • the input type and output feature name of the preset type recognition model can be referred to the above table 1 during the training; the speech synthesis system extracts the preset type acoustic feature vector corresponding to the text to be synthesized. And inputting the extracted preset type acoustic feature vector into the trained preset type recognition model, the recognition model identifying the voiceprint feature corresponding to the text to be synthesized.
  • a generating module 104 configured to perform voiceprint features and individual orders according to the text to be synthesized
  • the pronunciation of the word is based on the fundamental frequency, and the speech corresponding to the text to be synthesized is generated.
  • the speech synthesis system After the speech synthesis system obtains the voiceprint feature corresponding to the text to be synthesized, the speech synthesis system can generate the speech corresponding to the synthesized text according to the obtained voiceprint feature and the pronunciation fundamental frequency of each single word, thus completing the text to be synthesized. Speech synthesis.
  • the phrase and the sentence in the text to be synthesized are first divided into single words, and the pronunciation fundamental frequency, the pronunciation duration and the voice feature corresponding to each single word are determined; then, according to the voice features and pronunciations of the individual words corresponding to the text to be synthesized
  • the preset type acoustic feature vector corresponding to the text to be synthesized is extracted; and the extracted preset type acoustic feature vector is identified by the trained preset type recognition model, thereby identifying the sound corresponding to the text to be synthesized
  • the pattern is obtained; finally, according to the voiceprint feature corresponding to the text to be synthesized and the pronunciation fundamental frequency of each word, the voice corresponding to the text to be synthesized is generated.
  • the solution of the embodiment identifies the voiceprint feature corresponding to the text to be synthesized by using the trained preset type recognition model, and the preset type recognition model is The data has been trained to be completed by a large amount of data. Therefore, the accuracy of the voiceprint feature corresponding to the text to be synthesized is high, and then, according to the voiceprint feature corresponding to the text to be synthesized and the pronunciation fundamental frequency of each word, the generated The speech corresponding to the synthesized text is better in naturalness and clarity, and is not easy to make mistakes.
  • the preset type recognition model is a deep feedforward network model (DNN), and the deep feedforward network model is a five-layer neural network, and the neural nodes of each layer
  • the numbers are: 136L-75N-25S-75N-25L, L means using Linear Activation Function, N means using tanh Tangent Activation Function, and S means using sigmoid activation function.
  • the training process of the preset type recognition model in this embodiment is as follows:
  • Step E1 acquiring a preset number of training texts and corresponding training voices
  • the preset number is 100,000, that is, 100,000 training texts and training speech corresponding to the 100,000 training texts are obtained.
  • the training text includes, but is not limited to, a single word, a phrase, a sentence of Mandarin Chinese; for example, the training text may further include English letters, phrases, sentences, and the like.
  • step E2 the sentences and phrases in each training text are split into single words, and each single word is split into preset type voice features according to a predetermined pronunciation dictionary, and the voice features of each single word corresponding to each training text are determined;
  • the speech synthesis system first splits the sentences and phrases in each training text into single words, and then splits each single word into preset type speech features through a predetermined pronunciation dictionary in the speech synthesis system, thereby determining each training text correspondingly.
  • step E3 according to the mapping relationship between the predetermined word and the length of the pronunciation, the length of the pronunciation corresponding to each word is determined, and the words of each word corresponding to each training text are determined. a sound feature and a length of pronunciation, and extracting a preset type acoustic feature vector corresponding to each training text;
  • the speech synthesis system has a mapping table between the single word and the length of the pronunciation, according to the mapping table, the pronunciation duration of each single word corresponding to each training text can be queried; after determining the pronunciation duration of each single word corresponding to each training text, The speech synthesis system extracts the preset type acoustic feature vectors corresponding to the respective training texts according to the speech features and the pronunciation duration of each single word corresponding to each training text.
  • the preset type acoustic feature vector is an acoustic and linguistic feature vector, and the preset type acoustic feature vector specifically includes the acoustic and linguistic feature vectors in Table 2 above.
  • Step E4 processing each training speech by using a preset filter to extract a preset type of voiceprint feature of each training voice, and according to the mapping relationship between the training text and the training voice, the acoustic feature vector of each training text and the corresponding training The voiceprint features of the voice are correlated to obtain associated data of the acoustic feature vector and the voiceprint feature;
  • the preset filter is, for example, a Mel filter.
  • the speech synthesis system processes the training speech corresponding to each training text by using the preset filter to extract a preset type of voiceprint feature of each training voice, and then, according to the mapping relationship between the training text and the training voice, each training text is The acoustic feature vector is associated with the voiceprint feature of the corresponding training speech to obtain correlation data between the acoustic feature vector and the voiceprint feature.
  • the preset type voiceprint feature may be a Mel Frequency Cepstrum Coefficient (MFCC), and all coefficients of the training voice correspond to one feature matrix.
  • MFCC Mel Frequency Cepstrum Coefficient
  • Step E5 the associated data is divided into a first percentage training set and a second percentage verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;
  • the training set and the verification set respectively occupy a first percentage and a second percentage of the associated data
  • the sum of the first percentage and the second percentage is less than or equal to 100%, that is, the entire associated data may be just divided into the training set and the verification set, or part of the associated data may be Divided into the training set and the validation set; for example, the first percentage is 65% and the second percentage is 30%.
  • Step E6 training the preset type recognition model by using the correlation data between the acoustic feature vector and the voiceprint feature in the training set, and using the verification set to perform the accuracy of the preset type recognition model of the training after the training is completed. verification;
  • the system trains the preset type recognition model by using the associated data of the acoustic feature vector and the voiceprint feature in the training set. After the training of the preset type recognition model is completed, the preset type recognition model is passed through the verification set. Verify the accuracy.
  • step E7 if the accuracy rate is greater than the preset threshold, the model training ends;
  • the verification of the preset type recognition model by the verification set exceeds a preset threshold (for example, 98.5%)
  • a preset threshold for example, 98.5%
  • Step E8 If the accuracy rate is less than or equal to the preset threshold, increase the number of the training text and the corresponding training voice, and re-execute the foregoing steps E2, E3, E4, E5, and E6 based on the added training text and the corresponding training voice. .
  • the obtained accuracy rate is less than or equal to the preset threshold, indicating that the training effect of the preset type identification model has not reached the expected standard, and the number of training sets may not be sufficient or verified.
  • the number of sets is not enough, so in this case, increase the number of training texts and corresponding training voices (for example, increase the fixed number each time or increase the random number each time), and then re-execute the above step E2 based on this. , E3, E4, E5, and E6, are executed in this loop until the requirement of step E7 is reached, and the model training is ended.
  • the preset filter is a Mel filter (Mel filter); in the above step E4, each training speech is processed by using a preset filter to extract a preset type of voiceprint feature of each training voice.
  • the steps include:
  • each training speech is pre-emphasized, framing and windowing; wherein pre-emphasis is to compensate the high-frequency components of the training speech.
  • each window of each training speech is subjected to Fourier transform (ie, FFT transform) to obtain a corresponding spectrum.
  • Fourier transform ie, FFT transform
  • the obtained spectrum is obtained by a Mel filter to obtain a Mel spectrum
  • the spectrum obtained by the Fourier transform is then passed through a Mel filter, thus obtaining the Mel spectrum.
  • the cepstrum analysis is performed on the Mel spectrum to obtain the Mel frequency cepstrum coefficient MFCC, which is the voiceprint feature of the speech of the frame.
  • the cepstrum analysis of this embodiment includes taking logarithm and inverse transform.
  • the actual inverse transform is generally implemented by DCT discrete cosine transform, and the second to thirteenth coefficients after DCT are taken as MFCC coefficients.
  • the present application also provides a computer readable storage medium storing a speech synthesis system, the speech synthesis system being executable by at least one processor to cause the at least one processor to perform any of the above implementations
  • the speech synthesis method in the example is also provided.

Abstract

Disclosed in the present application are an electronic apparatus, speech synthesis method, and storage medium. The method comprises: upon receiving a text to be synthesized, dividing sentences and phrases of the text to be synthesized into words, determining, according to a predetermined mapping relationship between words, pronunciation durations, and pronunciation fundamental frequencies, a pronunciation duration and pronunciation fundamental frequency corresponding to each of the words, and categorizing, according to a predetermined pronunciation dictionary, respective words into predetermined speech feature types; extracting, according to the speech feature and pronunciation duration of each word, a predetermined type of acoustic feature vector corresponding to the text to be synthesized; inputting, into a trained predetermined type identification model, the predetermined type of acoustic feature vector corresponding to the text to be synthesized , and identifying a voiceprint feature of the text to be synthesized; and generating, according to the identified voiceprint feature and the pronunciation fundamental frequencies of the words, speech corresponding to the text to be synthesized. The technical solution in the present application enables highly accurate, natural, and clear speech synthesis results.

Description

电子装置、语音合成方法和计算机可读存储介质Electronic device, speech synthesis method, and computer readable storage medium
本申请基于巴黎公约申明享有2017年9月25日递交的申请号为CN 201710874876.2、名称为“电子装置、语音合成方法和计算机可读存储介质”中国专利申请的优先权,该中国专利申请的整体内容以参考的方式结合在本申请中。The present application is based on the priority of the Chinese Patent Application entitled "Electronic Device, Speech Synthesis Method and Computer Readable Storage Media", filed on September 25, 2017, with the application number of CN 201710874876. The content is incorporated herein by reference.
技术领域Technical field
本申请涉及语音技术领域,特别涉及一种电子装置、语音合成方法和计算机可读存储介质。The present application relates to the field of voice technologies, and in particular, to an electronic device, a voice synthesis method, and a computer readable storage medium.
背景技术Background technique
语音合成技术,也被称为文语转换技术(Text to Speech,speech synthesis,TTS),其目标是让机器通过识别和理解,把文本信息变成人造语音输出,是现代人工智能发展的重要分支。语音合成能够在质量检测、机器问答、残障辅助等领域发挥极大作用,方便人们的生活,而语音合成的自然度和清晰度直接决定了技术应用的有效性。目前,现有的语音合成方案通常采用传统混合高斯技术来构建语音单元,然而,语音合成归根结底是要完成一个从语素(语言学空间)到音素(声学空间)的建模映射,要达成的是一种复杂的非线性的模式映射,采用传统混合高斯技术无法实现高精度、高深度的特征挖掘和表达,容易出错Speech synthesis technology, also known as text to speech (speech synthesis, TTS), aims to make the text information into artificial speech output through recognition and understanding, which is an important branch of modern artificial intelligence development. . Speech synthesis can play a great role in quality detection, machine question and answer, disability assistance and other fields, which is convenient for people's life. The naturalness and clarity of speech synthesis directly determine the effectiveness of technical application. At present, the existing speech synthesis scheme usually uses traditional mixed Gaussian technology to construct speech units. However, speech synthesis is basically to complete a modeling mapping from morpheme (linguistic space) to phoneme (acoustic space). A complex nonlinear mode mapping, using traditional hybrid Gaussian technology can not achieve high-precision, high-depth feature mining and expression, easy to make mistakes
发明内容Summary of the invention
本申请提供一种电子装置、语音合成方法及计算机可读存储介质,旨在使语音合成结果的具有高精度、自然度和清晰度。The present application provides an electronic device, a speech synthesis method, and a computer readable storage medium, which are intended to have high precision, naturalness, and clarity of speech synthesis results.
本申请第一方面提供一种电子装置,包括存储器、处理器,所述存储器上存储有可在所述处理器上运行的语音合成系统,所述语音合成系统被所述处理器执行时实现如下步骤:A first aspect of the present application provides an electronic device, including a memory, a processor, and a memory synthesis system executable on the processor, where the voice synthesis system is implemented by the processor step:
在收到待进行语音合成的待合成文本后,将该待合成文本中的语句及词组拆分成单字,根据预先确定的单字、发音时长、发音基频三者之间的映射关系,确定各个单字对应的发音时长和发音基频,根据预先确定的发音字典将各个单字拆分成预设类型语音特征,确定出该待合成文本对应的各个单字的语音特征;After receiving the text to be synthesized for speech synthesis, the sentence and the phrase in the text to be synthesized are split into single words, and each mapping is determined according to a mapping relationship between the predetermined word, the length of the pronunciation, and the fundamental frequency of the pronunciation. The pronunciation duration and the pronunciation fundamental frequency of the single word are divided into preset type speech features according to a predetermined pronunciation dictionary, and the speech features of the individual words corresponding to the text to be synthesized are determined;
根据该待合成文本对应的各个单字的语音特征和发音时长,提取出该待合成文本对应的预设类型声学特征向量;Extracting a preset type acoustic feature vector corresponding to the to-be-synthesized text according to a voice feature and a pronunciation duration of each single word corresponding to the text to be synthesized;
将该待合成文本对应的预设类型声学特征向量输入到训练好的预设类型识别模型中,识别出该待合成文本对应的声纹特征;Inputting the preset type acoustic feature vector corresponding to the text to be synthesized into the trained preset type recognition model, and identifying the voiceprint feature corresponding to the text to be synthesized;
根据该待合成文本对应的声纹特征和各个单字的发音基频,生成 该待合成文本对应的语音。Generating according to the voiceprint feature corresponding to the text to be synthesized and the pronunciation fundamental frequency of each word The voice corresponding to the text to be synthesized.
本申请第二方面提供一种自动合成语音方法,该方法包括步骤:A second aspect of the present application provides an automatic synthesized speech method, the method comprising the steps of:
在收到待进行语音合成的待合成文本后,将该待合成文本中的语句及词组拆分成单字,根据预先确定的单字、发音时长、发音基频三者之间的映射关系,确定各个单字对应的发音时长和发音基频,根据预先确定的发音字典将各个单字拆分成预设类型语音特征,确定出该待合成文本对应的各个单字的语音特征;After receiving the text to be synthesized for speech synthesis, the sentence and the phrase in the text to be synthesized are split into single words, and each mapping is determined according to a mapping relationship between the predetermined word, the length of the pronunciation, and the fundamental frequency of the pronunciation. The pronunciation duration and the pronunciation fundamental frequency of the single word are divided into preset type speech features according to a predetermined pronunciation dictionary, and the speech features of the individual words corresponding to the text to be synthesized are determined;
根据该待合成文本对应的各个单字的语音特征和发音时长,提取出该待合成文本对应的预设类型声学特征向量;Extracting a preset type acoustic feature vector corresponding to the to-be-synthesized text according to a voice feature and a pronunciation duration of each single word corresponding to the text to be synthesized;
将该待合成文本对应的预设类型声学特征向量输入到训练好的预设类型识别模型中,识别出该待合成文本对应的声纹特征;Inputting the preset type acoustic feature vector corresponding to the text to be synthesized into the trained preset type recognition model, and identifying the voiceprint feature corresponding to the text to be synthesized;
根据该待合成文本对应的声纹特征和各个单字的发音基频,生成该待合成文本对应的语音。The voice corresponding to the text to be synthesized is generated according to the voiceprint feature corresponding to the text to be synthesized and the pronunciation fundamental frequency of each single word.
本申请第三方面提供一种计算机可读存储介质,所述计算机可读存储介质存储有语音合成系统,所述语音合成系统可被至少一个处理器执行,以使所述至少一个处理器执行如下步骤:A third aspect of the present application provides a computer readable storage medium storing a speech synthesis system, the speech synthesis system being executable by at least one processor to cause the at least one processor to execute as follows step:
在收到待进行语音合成的待合成文本后,将该待合成文本中的语句及词组拆分成单字,根据预先确定的单字、发音时长、发音基频三者之间的映射关系,确定各个单字对应的发音时长和发音基频,根据预先确定的发音字典将各个单字拆分成预设类型语音特征,确定出该待合成文本对应的各个单字的语音特征;After receiving the text to be synthesized for speech synthesis, the sentence and the phrase in the text to be synthesized are split into single words, and each mapping is determined according to a mapping relationship between the predetermined word, the length of the pronunciation, and the fundamental frequency of the pronunciation. The pronunciation duration and the pronunciation fundamental frequency of the single word are divided into preset type speech features according to a predetermined pronunciation dictionary, and the speech features of the individual words corresponding to the text to be synthesized are determined;
根据该待合成文本对应的各个单字的语音特征和发音时长,提取出该待合成文本对应的预设类型声学特征向量;Extracting a preset type acoustic feature vector corresponding to the to-be-synthesized text according to a voice feature and a pronunciation duration of each single word corresponding to the text to be synthesized;
将该待合成文本对应的预设类型声学特征向量输入到训练好的预设类型识别模型中,识别出该待合成文本对应的声纹特征;Inputting the preset type acoustic feature vector corresponding to the text to be synthesized into the trained preset type recognition model, and identifying the voiceprint feature corresponding to the text to be synthesized;
根据该待合成文本对应的声纹特征和各个单字的发音基频,生成该待合成文本对应的语音。The voice corresponding to the text to be synthesized is generated according to the voiceprint feature corresponding to the text to be synthesized and the pronunciation fundamental frequency of each single word.
本申请技术方案首先将待合成文本中的词组、语句拆分成单字,并确定各个单字对应的发音基频、发音时长和语音特征;然后,根据待合成文本对应的各个单字的语音特征和发音时长,提取出该待合成文本对应的预设类型声学特征向量;再用训练好的预设类型识别模型对提取出的预设类型声学特征向量进行识别,从而识别出该待合成文本对应的声纹特征;最终根据该待合成文本对应的声纹特征和各个单字的发音基频,生成该待合成文本对应的语音。与现有技术采用传统混合高斯技术构建语音单元的方式相比,本案通过采用训练好的预设类型识别模型来识别待合成文本对应的声纹特征,该预设类型识别模型为预先通过大量数据已经训练完成的,因此,识别得到的待合成文本对应的声纹特征的精确度高,进而,根据该待合成文本对应的声纹 特征和各个单字的发音基频,生成的该待合成文本对应的语音,自然度和清晰度都较佳,且不易出错。The technical solution of the present application first splits the phrases and sentences in the text to be synthesized into single words, and determines the pronunciation fundamental frequency, pronunciation duration and speech features corresponding to each single word; then, according to the speech features and pronunciations of the individual words corresponding to the text to be synthesized The preset type acoustic feature vector corresponding to the text to be synthesized is extracted; and the extracted preset type acoustic feature vector is identified by the trained preset type recognition model, thereby identifying the sound corresponding to the text to be synthesized The pattern is obtained; finally, according to the voiceprint feature corresponding to the text to be synthesized and the pronunciation fundamental frequency of each word, the voice corresponding to the text to be synthesized is generated. Compared with the prior art method of constructing a speech unit by using the traditional mixed Gaussian technology, the present invention identifies a voiceprint feature corresponding to the text to be synthesized by using a trained preset type recognition model, and the preset type recognition model is a large amount of data in advance. Has been trained to be completed, therefore, the accuracy of the voiceprint feature corresponding to the recognized text to be synthesized is high, and further, according to the voiceprint corresponding to the text to be synthesized The feature and the pronunciation base frequency of each single word, the generated speech corresponding to the text to be synthesized, the naturalness and the definition are better, and are not easy to make mistakes.
附图说明DRAWINGS
图1为本申请语音合成方法较佳实施例的流程示意图;1 is a schematic flow chart of a preferred embodiment of a speech synthesis method according to the present application;
图2为本申请语音合成方法较佳实施例中预设类型识别模型的训练过程的流程示意图;2 is a schematic flowchart of a training process of a preset type recognition model in a preferred embodiment of the speech synthesis method of the present application;
图3为本申请语音合成系统较佳实施例的运行环境示意图;3 is a schematic diagram of an operating environment of a preferred embodiment of a speech synthesis system of the present application;
图4为本申请语音合成系统较佳实施例的程序模块图。4 is a block diagram of a program of a preferred embodiment of a speech synthesis system of the present application.
具体实施方式Detailed ways
以下结合附图对本申请的原理和特征进行描述,所举实例只用于解释本申请,并非用于限定本申请的范围。The principles and features of the present application are described in the following with reference to the accompanying drawings, which are only used to explain the present application and are not intended to limit the scope of the application.
如图1所示,图1为本申请语音合成方法较佳实施例的流程示意图。As shown in FIG. 1, FIG. 1 is a schematic flowchart of a voice synthesizing method according to a preferred embodiment of the present application.
本实施例中,该语音合成方法包括:In this embodiment, the voice synthesis method includes:
步骤S10,在收到待进行语音合成的待合成文本后,将该待合成文本中的语句及词组拆分成单字,根据预先确定的单字、发音时长、发音基频三者之间的映射关系,确定各个单字对应的发音时长和发音基频,根据预先确定的发音字典将各个单字拆分成预设类型语音特征,确定出该待合成文本对应的各个单字的语音特征;Step S10, after receiving the text to be synthesized for synthesizing the speech, splitting the sentence and the phrase in the text to be synthesized into a single word, according to a mapping relationship between the predetermined word, the length of the pronunciation, and the fundamental frequency of the pronunciation. Determining a pronunciation duration and a pronunciation fundamental frequency corresponding to each single word, and dividing each single word into a preset type speech feature according to a predetermined pronunciation dictionary, and determining a speech feature of each single word corresponding to the to-be-synthesized text;
发音基频:有时候也可以称为音高,指的是发音的基础频率,当发声体由于振动而发出声音时,发出的声音一般可以分解为许多单纯的正弦波,也就是说所有的自然声音基本都是由许多频率不同的正弦波组成的,其中频率最低的正弦波即为基频。音素:指的是根据语音的自然属性划分出来的最小语音单位,从声学性质来看,音素是从音质角度划分出来的最小语音单位,从生理性质来看,一个发音动作形成一个音素,如〔ma〕包含〔m〕、〔a〕两个发音动作,是两个音素,相同发音动作发出的音就是同一音素,不同发音动作发出的音就是不同音素,如〔ma-mi〕中,两个〔m〕发音动作相同,是相同音素,〔a〕〔i〕发音动作不同,是不同音素。例如“普通话”,由三个音节“pu、tong、hua”组成,可以分析成“p,u,t,o,ng,h,u,a”八个音素。本实施例中,单字的发音基频和发音时长(即音长)可以通过预先训练的模型确定,比如通过预先训练的隐马尔科夫模型(Hidden Markov Model,HMM)确定;所述预设类型语音特征,例如可以包括音节、音素、声母、韵母。语音合成系统在接收到待进行语音合成的待合成文本后,对该待合成文本中的文字语句和词组进行拆分,以拆分成多个单字的形式;系统中具有预先确定的发音字典(例 如,普通话发音字典、粤语发音字典等)以及预先确定的单字、发音时长、发音基频这三者之间的映射表,语音合成系统在将待合成文本中的语句和词组拆分成单字后,再通过查找该映射表就能找出各个单字对应的发音时长和发音音频,以及根据该预先确定的发音字典将各个单字再拆分成预设类型语音特征,从而得到该待合成文本对应的各个单字的语音特征。Pronunciation fundamental frequency: Sometimes it can also be called pitch, which refers to the fundamental frequency of pronunciation. When the sounding body makes a sound due to vibration, the sound emitted can be decomposed into many simple sine waves, that is, all natural. The sound is basically composed of many sine waves with different frequencies, and the lowest frequency sine wave is the fundamental frequency. Phoneme: refers to the smallest phonetic unit based on the natural attributes of speech. From the perspective of acoustic properties, phoneme is the smallest unit of speech divided from the perspective of sound quality. From the physiological point of view, a pronunciation action forms a phoneme, such as [ Ma] contains two sounding actions [m] and [a]. The two phonemes are the same phoneme. The sounds emitted by different pronunciation actions are different phonemes, such as [ma-mi], two [m] The pronunciation is the same, it is the same phoneme, and [a][i] has different pronunciations and is different phonemes. For example, "Mandarin" consists of three syllables "pu, tong, hua", which can be analyzed into "p, u, t, o, ng, h, u, a" eight phonemes. In this embodiment, the pronunciation fundamental frequency and the pronunciation duration (ie, the sound length) of the single word may be determined by a pre-trained model, such as by a pre-trained Hidden Markov Model (HMM); the preset type Speech features, for example, may include syllables, phonemes, initials, and finals. After receiving the text to be synthesized for speech synthesis, the speech synthesis system splits the text sentence and the phrase in the text to be synthesized, and splits into a plurality of single words; the system has a predetermined pronunciation dictionary ( example For example, a Mandarin pronunciation dictionary, a Cantonese pronunciation dictionary, etc., and a mapping table between a predetermined word, a pronunciation duration, and a pronunciation fundamental frequency, the speech synthesis system splits the sentences and phrases in the text to be synthesized into single words. By searching the mapping table, the pronunciation duration and pronunciation audio corresponding to each single word can be found, and each word can be further divided into preset type speech features according to the predetermined pronunciation dictionary, thereby obtaining the corresponding text to be synthesized. The phonetic features of each word.
步骤S20,根据该待合成文本对应的各个单字的语音特征和发音时长,提取出该待合成文本对应的预设类型声学特征向量;In step S20, the preset type acoustic feature vector corresponding to the text to be synthesized is extracted according to the voice features and the pronunciation duration of each word corresponding to the text to be synthesized;
例如,所述预设类型声学特征向量为声学和语言学特征向量,所述预设类型声学特征向量包括下表1中的声学和语言学特征向量,即包括:因素类型、音长、音高、重音位置、口型、韵母|辅音类型、发音部位、韵母|辅音是否发音,以及是否重音、音节位置、音素在音节中的位置、音节在字中的位置。For example, the preset type acoustic feature vector is an acoustic and linguistic feature vector, and the preset type acoustic feature vector includes the acoustic and linguistic feature vectors in Table 1 below, including: factor type, length, pitch , accent position, lip shape, finals | consonant type, pronunciation part, finals | consonant pronunciation, and whether accent, syllable position, position of phoneme in syllable, position of syllable in word.
表1声学特征向量示例Table 1 Example of acoustic feature vector
Figure PCTCN2017108766-appb-000001
Figure PCTCN2017108766-appb-000001
步骤S30,将该待合成文本对应的预设类型声学特征向量输入到训练好的预设类型识别模型中,识别出该待合成文本对应的声纹特征;In step S30, the preset type acoustic feature vector corresponding to the text to be synthesized is input into the trained preset type recognition model, and the voiceprint feature corresponding to the text to be synthesized is identified;
语音合成系统预先训练好了预设类型识别模型,该预设类型识别模型训练时输入输出特征名称可参照上表1;语音合成系统在提取出该待合成文本对应的预设类型声学特征向量后,将提取出的预设类型声学特征向量输入到该训练好的预设类型识别模型中,该识别模型识别出该待合成文本对应的声纹特征。 The speech synthesis system pre-trains the preset type recognition model. The input type and output feature name of the preset type recognition model can be referred to the above table 1 during the training; the speech synthesis system extracts the preset type acoustic feature vector corresponding to the text to be synthesized. And inputting the extracted preset type acoustic feature vector into the trained preset type recognition model, the recognition model identifying the voiceprint feature corresponding to the text to be synthesized.
步骤S40,根据该待合成文本对应的声纹特征和各个单字的发音基频,生成该待合成文本对应的语音。Step S40: Generate a voice corresponding to the text to be synthesized according to the voiceprint feature corresponding to the text to be synthesized and the pronunciation fundamental frequency of each single word.
当语音合成系统得到待合成文本对应的声纹特征后,语音合成系统即可根据该得到的声纹特征和各个单字的发音基频,生成该合成文本对应的语音,如此即完成待合成文本的语音合成。After the speech synthesis system obtains the voiceprint feature corresponding to the text to be synthesized, the speech synthesis system can generate the speech corresponding to the synthesized text according to the obtained voiceprint feature and the pronunciation fundamental frequency of each single word, thus completing the text to be synthesized. Speech synthesis.
本实施例方案首先将待合成文本中的词组、语句拆分成单字,并确定各个单字对应的发音基频、发音时长和语音特征;然后,根据待合成文本对应的各个单字的语音特征和发音时长,提取出该待合成文本对应的预设类型声学特征向量;再用训练好的预设类型识别模型对提取出的预设类型声学特征向量进行识别,从而识别出该待合成文本对应的声纹特征;最终根据该待合成文本对应的声纹特征和各个单字的发音基频,生成该待合成文本对应的语音。与现有技术采用传统混合高斯技术构建语音单元的方式相比,本实施例方案通过采用训练好的预设类型识别模型来识别待合成文本对应的声纹特征,该预设类型识别模型为预先通过大量数据已经训练完成的,因此,识别得到的待合成文本对应的声纹特征的精确度高,进而,根据该待合成文本对应的声纹特征和各个单字的发音基频,生成的该待合成文本对应的语音,自然度和清晰度都较佳,且不易出错。In the embodiment, the phrase and the sentence in the text to be synthesized are first divided into single words, and the pronunciation fundamental frequency, the pronunciation duration and the voice feature corresponding to each single word are determined; then, according to the voice features and pronunciations of the individual words corresponding to the text to be synthesized The preset type acoustic feature vector corresponding to the text to be synthesized is extracted; and the extracted preset type acoustic feature vector is identified by the trained preset type recognition model, thereby identifying the sound corresponding to the text to be synthesized The pattern is obtained; finally, according to the voiceprint feature corresponding to the text to be synthesized and the pronunciation fundamental frequency of each word, the voice corresponding to the text to be synthesized is generated. Compared with the prior art method of constructing a speech unit by using a conventional hybrid Gaussian technique, the solution of the embodiment identifies the voiceprint feature corresponding to the text to be synthesized by using the trained preset type recognition model, and the preset type recognition model is The data has been trained to be completed by a large amount of data. Therefore, the accuracy of the voiceprint feature corresponding to the text to be synthesized is high, and then, according to the voiceprint feature corresponding to the text to be synthesized and the pronunciation fundamental frequency of each word, the generated The speech corresponding to the synthesized text is better in naturalness and clarity, and is not easy to make mistakes.
优选地,本实施例中,所述预设类型识别模型为深度前馈网络模型(deep feedforward network model,DNN),该深度前馈网络模型是一个五层的神经网络,各层的神经元节点数目分别为:136L-75N-25S-75N-25L,L表示采用线性激活函数(Linear Activation Function),N表示采用正切性激活函数(tanh Tangent Activation Function),S表示采用sigmoid激活函数。Preferably, in this embodiment, the preset type recognition model is a deep feedforward network model (DNN), and the deep feedforward network model is a five-layer neural network, and the neural nodes of each layer The numbers are: 136L-75N-25S-75N-25L, L means using Linear Activation Function, N means using tanh Tangent Activation Function, and S means using sigmoid activation function.
优选地,如图2所示,所述预设类型识别模型的训练过程如下:Preferably, as shown in FIG. 2, the training process of the preset type recognition model is as follows:
步骤E1,获取预设数量的训练文本和对应的训练语音;Step E1: acquiring a preset number of training texts and corresponding training voices;
例如,预设数量为10万个,即获取10万个训练文本和该10万个训练文本对应的训练语音。本实施例中,所述训练文本包括但不限于汉语普通话的单字、词组、语句;例如,所述训练文本还可包括英文的字母、词组、语句等。For example, the preset number is 100,000, that is, 100,000 training texts and training speech corresponding to the 100,000 training texts are obtained. In this embodiment, the training text includes, but is not limited to, a single word, a phrase, a sentence of Mandarin Chinese; for example, the training text may further include English letters, phrases, sentences, and the like.
步骤E2,将各个训练文本中的语句及词组拆分成单字,根据预先确定的发音字典将各个单字拆分成预设类型语音特征,确定出各个训练文本对应的各个单字的语音特征;In step E2, the sentences and phrases in each training text are split into single words, and each single word is split into preset type voice features according to a predetermined pronunciation dictionary, and the voice features of each single word corresponding to each training text are determined;
语音合成系统先将各个训练文本中的语句和词组全部拆分成单字,再通过语音合成系统中预先确定的发音字典将各个单字拆分成预设类型语音特征,从而确定出每个训练文本对应的各个单字的语音特征;其中,该预设类型语音特征例如包括音节、音素、声母、韵母。The speech synthesis system first splits the sentences and phrases in each training text into single words, and then splits each single word into preset type speech features through a predetermined pronunciation dictionary in the speech synthesis system, thereby determining each training text correspondingly. The voice features of the individual words; wherein the preset type of voice features include, for example, syllables, phonemes, initials, and finals.
步骤E3,根据预先确定的单字和发音时长之间的映射关系,确 定各个单字对应的发音时长,根据各个训练文本对应的各个单字的语音特征和发音时长,提取出各个训练文本对应的预设类型声学特征向量;Step E3, according to the mapping relationship between the predetermined word and the length of the pronunciation, Determining the length of the pronunciation corresponding to each single word, and extracting the preset type acoustic feature vector corresponding to each training text according to the voice features and the length of the pronunciation of each single word corresponding to each training text;
语音合成系统中具有单字和发音时长之间的映射表,根据该映射表就可以查询到各个训练文本所对应的各个单字的发音时长;在确定了各个训练文本对应的各个单字的发音时长后,语音合成系统则根据各个训练文本对应的各个单字的语音特征和发音时长,提取出各个训练文本对应的预设类型声学特征向量。例如,该预设类型声学特征向量为声学和语言学特征向量,该预设类型声学特征向量具体包括上述表1中的声学和语言学特征向量。The speech synthesis system has a mapping table between the single word and the length of the pronunciation, according to the mapping table, the pronunciation duration of each single word corresponding to each training text can be queried; after determining the pronunciation duration of each single word corresponding to each training text, The speech synthesis system extracts the preset type acoustic feature vectors corresponding to the respective training texts according to the speech features and the pronunciation duration of each single word corresponding to each training text. For example, the preset type acoustic feature vector is an acoustic and linguistic feature vector, and the preset type acoustic feature vector specifically includes the acoustic and linguistic feature vectors in Table 1 above.
步骤E4,利用预设滤波器对各个训练语音进行处理以提取出各个训练语音的预设类型声纹特征,根据训练文本和训练语音的映射关系,将各个训练文本的声学特征向量与对应的训练语音的声纹特征进行关联,得到声学特征向量与声纹特征的关联数据;Step E4: processing each training speech by using a preset filter to extract a preset type of voiceprint feature of each training voice, and according to the mapping relationship between the training text and the training voice, the acoustic feature vector of each training text and the corresponding training The voiceprint features of the voice are correlated to obtain associated data of the acoustic feature vector and the voiceprint feature;
本实施例中,所述预设滤波器例如为梅尔(Mel)滤波器。语音合成系统利用该预设滤波器对各个训练文本对应的训练语音进行处理,以提取出各个训练语音的预设类型声纹特征,再根据训练文本和训练语音的映射关系,将各个训练文本的声学特征向量与对应的训练语音的声纹特征关联,从而得到声学特征向量与声纹特征的关联数据。本实施例中,该预设类型声纹特征可以是梅尔频率倒谱系数(Mel Frequency Cepstrum Coefficient,MFCC),所述训练语音的所有系数对应一个特征矩阵。In this embodiment, the preset filter is, for example, a Mel filter. The speech synthesis system processes the training speech corresponding to each training text by using the preset filter to extract a preset type of voiceprint feature of each training voice, and then, according to the mapping relationship between the training text and the training voice, each training text is The acoustic feature vector is associated with the voiceprint feature of the corresponding training speech to obtain correlation data between the acoustic feature vector and the voiceprint feature. In this embodiment, the preset type voiceprint feature may be a Mel Frequency Cepstrum Coefficient (MFCC), and all coefficients of the training voice correspond to one feature matrix.
步骤E5,将所述关联数据分为第一百分比的训练集和第二百分比的验证集,所述第一百分比和第二百分比之和小于或者等于100%;Step E5, the associated data is divided into a first percentage training set and a second percentage verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;
从所述声纹特征向量与声纹特征的关联数据中分出一个训练集和一个验证集,所述训练集和验证集分别占所述关联数据的第一百分比和第二百分比,所述第一百分比和第二百分比之和小于或者等于100%,即可以是将整个关联数据刚好分成所述训练集和验证集,也可以是将所述关联数据中的一部分分成所述训练集和验证集;例如,所述第一百分比为65%,所述第二百分比为30%。And dividing a training set and a verification set from the associated data of the voiceprint feature vector and the voiceprint feature, wherein the training set and the verification set respectively occupy a first percentage and a second percentage of the associated data The sum of the first percentage and the second percentage is less than or equal to 100%, that is, the entire associated data may be just divided into the training set and the verification set, or part of the associated data may be Divided into the training set and the validation set; for example, the first percentage is 65% and the second percentage is 30%.
步骤E6,利用训练集中的声学特征向量与声纹特征的关联数据对所述预设类型识别模型进行训练,并在训练完成后利用验证集对训练的所述预设类型识别模型的准确率进行验证;Step E6: training the preset type recognition model by using the correlation data between the acoustic feature vector and the voiceprint feature in the training set, and using the verification set to perform the accuracy of the preset type recognition model of the training after the training is completed. verification;
系统通过采用训练集中的声学特征向量与声纹特征的关联数据对所述预设类型识别模型进行训练,在所述预设类型识别模型训练完成后,再通过验证集对该预设类型识别模型进行准确率的验证。The system trains the preset type recognition model by using the associated data of the acoustic feature vector and the voiceprint feature in the training set. After the training of the preset type recognition model is completed, the preset type recognition model is passed through the verification set. Verify the accuracy.
步骤E7,若准确率大于预设阈值,则模型训练结束;In step E7, if the accuracy rate is greater than the preset threshold, the model training ends;
如果经过验证集对该预设类型识别模型的验证,得到的准确率超 过了预设阈值(例如,98.5%),则说明对该预设类型识别模型的训练效果已经达到了预期标准,则结束模型训练,语音合成系统可将该训练的预设类型识别模型进行运用。If the verification set verifies the preset type recognition model, the accuracy rate obtained is super After the preset threshold (for example, 98.5%), the training effect of the preset type recognition model has reached the expected standard, and the model training is ended, and the speech synthesis system can apply the preset type recognition model of the training. .
步骤E8,若准确率小于或者等于预设阈值,则增加训练文本和对应的训练语音的数量,并基于增加后的训练文本和对应的训练语音重新执行上述步骤E2、E3、E4、E5和E6。Step E8: If the accuracy rate is less than or equal to the preset threshold, increase the number of the training text and the corresponding training voice, and re-execute the foregoing steps E2, E3, E4, E5, and E6 based on the added training text and the corresponding training voice. .
如果经过验证集对该预设类型识别模型的验证,得到的准确率小于或等于预设阈值,说明对该预设类型识别模型的训练效果还没有达到预期标准,可能是训练集数量不够或验证集数量不够,所以,在这种情况时,则增加训练文本和对应的训练语音的数量(例如,每次增加固定数量或每次增加随机数量),然后在这基础上,重新执行上述步骤E2、E3、E4、E5和E6,如此循环执行,直至达到了步骤E7的要求,则结束模型训练。If the verification of the preset type identification model is verified by the verification set, the obtained accuracy rate is less than or equal to the preset threshold, indicating that the training effect of the preset type identification model has not reached the expected standard, and the number of training sets may not be sufficient or verified. The number of sets is not enough, so in this case, increase the number of training texts and corresponding training voices (for example, increase the fixed number each time or increase the random number each time), and then re-execute the above step E2 based on this. , E3, E4, E5, and E6, are executed in this loop until the requirement of step E7 is reached, and the model training is ended.
本实施例优选所述预设滤波器为Mel滤波器(梅尔滤波器);所述步骤E4中,利用预设滤波器对各个训练语音进行处理以提取出各个训练语音的预设类型声纹特征的步骤包括:In this embodiment, the preset filter is a Mel filter (Mel filter); in the step E4, each training voice is processed by using a preset filter to extract a preset type voiceprint of each training voice. The steps of the feature include:
将各个训练语音进行预加重、分帧和加窗处理;Pre-emphasis, framing, and windowing of each training speech;
首先对各个训练语音均进行预加重、分帧及加窗处理;其中,预加重就是对训练语音高频分量进行补偿。Firstly, each training speech is pre-emphasized, framing and windowing; wherein pre-emphasis is to compensate the high-frequency components of the training speech.
对每一个加窗,通过傅立叶变换得到对应的频谱;For each windowing, the corresponding spectrum is obtained by Fourier transform;
然后,再对各个训练语音的每一个加窗进行傅立叶变换(即FFT变换),以得到对应的频谱。Then, each window of each training speech is subjected to Fourier transform (ie, FFT transform) to obtain a corresponding spectrum.
将得到的频谱通过Mel滤波器得到Mel频谱;The obtained spectrum is obtained by a Mel filter to obtain a Mel spectrum;
接着将经傅里叶变换得到的频谱通过Mel滤波器,如此得到Mel频谱。The spectrum obtained by the Fourier transform is then passed through a Mel filter, thus obtaining the Mel spectrum.
在Mel频谱上面进行倒谱分析,获得Mel频率倒谱系数MFCC,该MFCC就是这帧语音的声纹特征。The cepstrum analysis is performed on the Mel spectrum to obtain the Mel frequency cepstrum coefficient MFCC, which is the voiceprint feature of the speech of the frame.
本实施例的倒谱分析包括取对数、做逆变换,实际逆变换一般是通过DCT离散余弦变换来实现,取DCT后的第2个到第13个系数作为MFCC系数。The cepstrum analysis of this embodiment includes taking logarithm and inverse transform. The actual inverse transform is generally implemented by DCT discrete cosine transform, and the second to thirteenth coefficients after DCT are taken as MFCC coefficients.
本申请还提出一种语音合成系统。The application also proposes a speech synthesis system.
请参阅图3,是本申请语音合成系统10较佳实施例的运行环境示意图。Please refer to FIG. 3 , which is a schematic diagram of an operating environment of a preferred embodiment of the speech synthesis system 10 of the present application.
在本实施例中,语音合成系统10安装并运行于电子装置1中。电子装置1可以是桌上型计算机、笔记本、掌上电脑及服务器等计算设备。该电子装置1可包括,但不仅限于,存储器11、处理器12及显示器13。图3仅示出了具有组件11-13的电子装置1,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少 的组件。In the present embodiment, the speech synthesis system 10 is installed and operated in the electronic device 1. The electronic device 1 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a server. The electronic device 1 may include, but is not limited to, a memory 11, a processor 12, and a display 13. Figure 3 shows only the electronic device 1 with components 11-13, but it should be understood that not all illustrated components may be implemented, alternative implementations may be more or less s component.
存储器11为一种计算机存储介质,在一些实施例中可以是电子装置1的内部存储单元,例如该电子装置1的硬盘或内存。存储器11在另一些实施例中也可以是电子装置1的外部存储设备,例如电子装置1上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,存储器11还可以既包括电子装置1的内部存储单元也包括外部存储设备。存储器11用于存储安装于电子装置1的应用软件及各类数据,例如语音合成系统10的程序代码等。存储器11还可以用于暂时地存储已经输出或者将要输出的数据。The memory 11 is a computer storage medium, which in some embodiments may be an internal storage unit of the electronic device 1, such as a hard disk or memory of the electronic device 1. The memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in hard disk equipped on the electronic device 1, a smart memory card (SMC), and a secure digital (SD). Card, flash card, etc. Further, the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device. The memory 11 is used to store application software installed in the electronic device 1 and various types of data, such as program codes of the speech synthesis system 10. The memory 11 can also be used to temporarily store data that has been output or is about to be output.
处理器12在一些实施例中可以是一中央处理器(Central Processing Unit,CPU),微处理器或其他数据处理芯片,用于运行存储器11中存储的程序代码或处理数据,例如执行语音合成系统10等。The processor 12, in some embodiments, may be a Central Processing Unit (CPU), microprocessor or other data processing chip for running program code or processing data stored in the memory 11, such as executing a speech synthesis system. 10 and so on.
显示器13在一些实施例中可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。显示器13用于显示在电子装置1中处理的信息以及用于显示可视化的用户界面,例如业务定制界面等。电子装置1的部件11-13通过系统总线相互通信。The display 13 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch sensor, or the like in some embodiments. The display 13 is for displaying information processed in the electronic device 1 and a user interface for displaying visualization, such as a business customization interface or the like. The components 11-13 of the electronic device 1 communicate with one another via a system bus.
请参阅图4,是本申请语音合成系统10较佳实施例的程序模块图。在本实施例中,语音合成系统10可以被分割成一个或多个模块,一个或者多个模块被存储于存储器11中,并由一个或多个处理器(本实施例为处理器12)所执行,以完成本申请。例如,在图4中,语音合成系统10可以被分割成确定模块101、提取模块102、识别模块103及生成模块104。本申请所称的模块是指能够完成特定功能的一系列计算机程序指令段,比程序更适合于描述语音合成系统10在电子装置1中的执行过程,其中:Please refer to FIG. 4, which is a program block diagram of a preferred embodiment of the speech synthesis system 10 of the present application. In the present embodiment, the speech synthesis system 10 can be divided into one or more modules, one or more modules being stored in the memory 11 and being executed by one or more processors (the processor 12 in this embodiment). Execute to complete this application. For example, in FIG. 4, the speech synthesis system 10 can be segmented into a determination module 101, an extraction module 102, an identification module 103, and a generation module 104. A module referred to in the present application refers to a series of computer program instruction segments capable of performing a specific function, and is more suitable than the program for describing the execution process of the speech synthesis system 10 in the electronic device 1, wherein:
确定模块101,用于在收到待进行语音合成的待合成文本后,将该待合成文本中的语句及词组拆分成单字,根据预先确定的单字、发音时长、发音基频三者之间的映射关系,确定各个单字对应的发音时长和发音基频,根据预先确定的发音字典将各个单字拆分成预设类型语音特征,确定出该待合成文本对应的各个单字的语音特征;The determining module 101 is configured to split the sentence and the phrase in the text to be synthesized into a single word after receiving the text to be synthesized for the speech synthesis, according to the predetermined single word, the pronunciation duration, and the pronunciation fundamental frequency. a mapping relationship, determining a pronunciation duration and a pronunciation fundamental frequency corresponding to each single word, and dividing each single word into a preset type of speech feature according to a predetermined pronunciation dictionary, and determining a speech feature of each single word corresponding to the to-be-synthesized text;
本实施例中,单字的发音基频和发音时长(即音长)可以通过预先训练的模型确定,比如通过预先训练的隐马尔科夫模型(Hidden Markov Model,HMM)确定;所述预设类型语音特征,例如可以包括音节、音素、声母、韵母。语音合成系统在接收到待进行语音合成的待合成文本后,对该待合成文本中的文字语句和词组进行拆分,以拆分成多个单字的形式;系统中具有预先确定的发音字典(例如,普通话发音字典、粤语发音字典等)以及预先确定的单字、发音时长、 发音基频这三者之间的映射表,语音合成系统在将待合成文本中的语句和词组拆分成单字后,再通过查找该映射表就能找出各个单字对应的发音时长和发音音频,以及根据该预先确定的发音字典将各个单字再拆分成预设类型语音特征,从而得到该待合成文本对应的各个单字的语音特征。In this embodiment, the pronunciation fundamental frequency and the pronunciation duration (ie, the sound length) of the single word may be determined by a pre-trained model, such as by a pre-trained Hidden Markov Model (HMM); the preset type Speech features, for example, may include syllables, phonemes, initials, and finals. After receiving the text to be synthesized for speech synthesis, the speech synthesis system splits the text sentence and the phrase in the text to be synthesized, and splits into a plurality of single words; the system has a predetermined pronunciation dictionary ( For example, the Mandarin pronunciation dictionary, the Cantonese pronunciation dictionary, etc.) as well as predetermined words, pronunciation duration, The mapping table between the three basic frequencies of the pronunciation, the speech synthesis system splits the sentences and phrases in the text to be synthesized into single words, and then finds the pronunciation duration and pronunciation audio corresponding to each single word by searching the mapping table. And subdividing each word into a preset type of voice feature according to the predetermined pronunciation dictionary, thereby obtaining a voice feature of each word corresponding to the text to be synthesized.
提取模块102,用于根据该待合成文本对应的各个单字的语音特征和发音时长,提取出该待合成文本对应的预设类型声学特征向量;The extraction module 102 is configured to extract a preset type acoustic feature vector corresponding to the text to be synthesized according to the voice features and the pronunciation duration of each single word corresponding to the text to be synthesized;
例如,所述预设类型声学特征向量为声学和语言学特征向量,所述预设类型声学特征向量包括下表2中的声学和语言学特征向量,即包括:因素类型、音长、音高、重音位置、口型、韵母|辅音类型、发音部位、韵母|辅音是否发音,以及是否重音、音节位置、音素在音节中的位置、音节在字中的位置。For example, the preset type acoustic feature vector is an acoustic and linguistic feature vector, and the preset type acoustic feature vector includes the acoustic and linguistic feature vectors in Table 2 below, including: factor type, length, pitch , accent position, lip shape, finals | consonant type, pronunciation part, finals | consonant pronunciation, and whether accent, syllable position, position of phoneme in syllable, position of syllable in word.
表2声学特征向量示例Table 2 Example of acoustic feature vector
Figure PCTCN2017108766-appb-000002
Figure PCTCN2017108766-appb-000002
识别模块103,用于将该待合成文本对应的预设类型声学特征向量输入到训练好的预设类型识别模型中,识别出该待合成文本对应的声纹特征;The identification module 103 is configured to input the preset type acoustic feature vector corresponding to the text to be synthesized into the trained preset type recognition model, and identify the voiceprint feature corresponding to the text to be synthesized;
语音合成系统预先训练好了预设类型识别模型,该预设类型识别模型训练时输入输出特征名称可参照上表1;语音合成系统在提取出该待合成文本对应的预设类型声学特征向量后,将提取出的预设类型声学特征向量输入到该训练好的预设类型识别模型中,该识别模型识别出该待合成文本对应的声纹特征。The speech synthesis system pre-trains the preset type recognition model. The input type and output feature name of the preset type recognition model can be referred to the above table 1 during the training; the speech synthesis system extracts the preset type acoustic feature vector corresponding to the text to be synthesized. And inputting the extracted preset type acoustic feature vector into the trained preset type recognition model, the recognition model identifying the voiceprint feature corresponding to the text to be synthesized.
生成模块104,用于根据该待合成文本对应的声纹特征和各个单 字的发音基频,生成该待合成文本对应的语音。a generating module 104, configured to perform voiceprint features and individual orders according to the text to be synthesized The pronunciation of the word is based on the fundamental frequency, and the speech corresponding to the text to be synthesized is generated.
当语音合成系统得到待合成文本对应的声纹特征后,语音合成系统即可根据该得到的声纹特征和各个单字的发音基频,生成该合成文本对应的语音,如此即完成待合成文本的语音合成。After the speech synthesis system obtains the voiceprint feature corresponding to the text to be synthesized, the speech synthesis system can generate the speech corresponding to the synthesized text according to the obtained voiceprint feature and the pronunciation fundamental frequency of each single word, thus completing the text to be synthesized. Speech synthesis.
本实施例方案首先将待合成文本中的词组、语句拆分成单字,并确定各个单字对应的发音基频、发音时长和语音特征;然后,根据待合成文本对应的各个单字的语音特征和发音时长,提取出该待合成文本对应的预设类型声学特征向量;再用训练好的预设类型识别模型对提取出的预设类型声学特征向量进行识别,从而识别出该待合成文本对应的声纹特征;最终根据该待合成文本对应的声纹特征和各个单字的发音基频,生成该待合成文本对应的语音。与现有技术采用传统混合高斯技术构建语音单元的方式相比,本实施例方案通过采用训练好的预设类型识别模型来识别待合成文本对应的声纹特征,该预设类型识别模型为预先通过大量数据已经训练完成的,因此,识别得到的待合成文本对应的声纹特征的精确度高,进而,根据该待合成文本对应的声纹特征和各个单字的发音基频,生成的该待合成文本对应的语音,自然度和清晰度都较佳,且不易出错。In the embodiment, the phrase and the sentence in the text to be synthesized are first divided into single words, and the pronunciation fundamental frequency, the pronunciation duration and the voice feature corresponding to each single word are determined; then, according to the voice features and pronunciations of the individual words corresponding to the text to be synthesized The preset type acoustic feature vector corresponding to the text to be synthesized is extracted; and the extracted preset type acoustic feature vector is identified by the trained preset type recognition model, thereby identifying the sound corresponding to the text to be synthesized The pattern is obtained; finally, according to the voiceprint feature corresponding to the text to be synthesized and the pronunciation fundamental frequency of each word, the voice corresponding to the text to be synthesized is generated. Compared with the prior art method of constructing a speech unit by using a conventional hybrid Gaussian technique, the solution of the embodiment identifies the voiceprint feature corresponding to the text to be synthesized by using the trained preset type recognition model, and the preset type recognition model is The data has been trained to be completed by a large amount of data. Therefore, the accuracy of the voiceprint feature corresponding to the text to be synthesized is high, and then, according to the voiceprint feature corresponding to the text to be synthesized and the pronunciation fundamental frequency of each word, the generated The speech corresponding to the synthesized text is better in naturalness and clarity, and is not easy to make mistakes.
优选地,本实施例中,所述预设类型识别模型为深度前馈网络模型(deep feedforward network model,DNN),该深度前馈网络模型是一个五层的神经网络,各层的神经元节点数目分别为:136L-75N-25S-75N-25L,L表示采用线性激活函数(Linear Activation Function),N表示采用正切性激活函数(tanh Tangent Activation Function),S表示采用sigmoid激活函数。Preferably, in this embodiment, the preset type recognition model is a deep feedforward network model (DNN), and the deep feedforward network model is a five-layer neural network, and the neural nodes of each layer The numbers are: 136L-75N-25S-75N-25L, L means using Linear Activation Function, N means using tanh Tangent Activation Function, and S means using sigmoid activation function.
具体地,本实施例中的所述预设类型识别模型的训练过程如下:Specifically, the training process of the preset type recognition model in this embodiment is as follows:
步骤E1,获取预设数量的训练文本和对应的训练语音;Step E1: acquiring a preset number of training texts and corresponding training voices;
例如,预设数量为10万个,即获取10万个训练文本和该10万个训练文本对应的训练语音。本实施例中,所述训练文本包括但不限于汉语普通话的单字、词组、语句;例如,所述训练文本还可包括英文的字母、词组、语句等。For example, the preset number is 100,000, that is, 100,000 training texts and training speech corresponding to the 100,000 training texts are obtained. In this embodiment, the training text includes, but is not limited to, a single word, a phrase, a sentence of Mandarin Chinese; for example, the training text may further include English letters, phrases, sentences, and the like.
步骤E2,将各个训练文本中的语句及词组拆分成单字,根据预先确定的发音字典将各个单字拆分成预设类型语音特征,确定出各个训练文本对应的各个单字的语音特征;In step E2, the sentences and phrases in each training text are split into single words, and each single word is split into preset type voice features according to a predetermined pronunciation dictionary, and the voice features of each single word corresponding to each training text are determined;
语音合成系统先将各个训练文本中的语句和词组全部拆分成单字,再通过语音合成系统中预先确定的发音字典将各个单字拆分成预设类型语音特征,从而确定出每个训练文本对应的各个单字的语音特征;其中,该预设类型语音特征例如包括音节、音素、声母、韵母。The speech synthesis system first splits the sentences and phrases in each training text into single words, and then splits each single word into preset type speech features through a predetermined pronunciation dictionary in the speech synthesis system, thereby determining each training text correspondingly. The voice features of the individual words; wherein the preset type of voice features include, for example, syllables, phonemes, initials, and finals.
步骤E3,根据预先确定的单字和发音时长之间的映射关系,确定各个单字对应的发音时长,根据各个训练文本对应的各个单字的语 音特征和发音时长,提取出各个训练文本对应的预设类型声学特征向量;In step E3, according to the mapping relationship between the predetermined word and the length of the pronunciation, the length of the pronunciation corresponding to each word is determined, and the words of each word corresponding to each training text are determined. a sound feature and a length of pronunciation, and extracting a preset type acoustic feature vector corresponding to each training text;
语音合成系统中具有单字和发音时长之间的映射表,根据该映射表就可以查询到各个训练文本所对应的各个单字的发音时长;在确定了各个训练文本对应的各个单字的发音时长后,语音合成系统则根据各个训练文本对应的各个单字的语音特征和发音时长,提取出各个训练文本对应的预设类型声学特征向量。例如,该预设类型声学特征向量为声学和语言学特征向量,该预设类型声学特征向量具体包括上述表2中的声学和语言学特征向量。The speech synthesis system has a mapping table between the single word and the length of the pronunciation, according to the mapping table, the pronunciation duration of each single word corresponding to each training text can be queried; after determining the pronunciation duration of each single word corresponding to each training text, The speech synthesis system extracts the preset type acoustic feature vectors corresponding to the respective training texts according to the speech features and the pronunciation duration of each single word corresponding to each training text. For example, the preset type acoustic feature vector is an acoustic and linguistic feature vector, and the preset type acoustic feature vector specifically includes the acoustic and linguistic feature vectors in Table 2 above.
步骤E4,利用预设滤波器对各个训练语音进行处理以提取出各个训练语音的预设类型声纹特征,根据训练文本和训练语音的映射关系,将各个训练文本的声学特征向量与对应的训练语音的声纹特征进行关联,得到声学特征向量与声纹特征的关联数据;Step E4: processing each training speech by using a preset filter to extract a preset type of voiceprint feature of each training voice, and according to the mapping relationship between the training text and the training voice, the acoustic feature vector of each training text and the corresponding training The voiceprint features of the voice are correlated to obtain associated data of the acoustic feature vector and the voiceprint feature;
本实施例中,所述预设滤波器例如为梅尔(Mel)滤波器。语音合成系统利用该预设滤波器对各个训练文本对应的训练语音进行处理,以提取出各个训练语音的预设类型声纹特征,再根据训练文本和训练语音的映射关系,将各个训练文本的声学特征向量与对应的训练语音的声纹特征关联,从而得到声学特征向量与声纹特征的关联数据。本实施例中,该预设类型声纹特征可以是梅尔频率倒谱系数(Mel Frequency Cepstrum Coefficient,MFCC),所述训练语音的所有系数对应一个特征矩阵。In this embodiment, the preset filter is, for example, a Mel filter. The speech synthesis system processes the training speech corresponding to each training text by using the preset filter to extract a preset type of voiceprint feature of each training voice, and then, according to the mapping relationship between the training text and the training voice, each training text is The acoustic feature vector is associated with the voiceprint feature of the corresponding training speech to obtain correlation data between the acoustic feature vector and the voiceprint feature. In this embodiment, the preset type voiceprint feature may be a Mel Frequency Cepstrum Coefficient (MFCC), and all coefficients of the training voice correspond to one feature matrix.
步骤E5,将所述关联数据分为第一百分比的训练集和第二百分比的验证集,所述第一百分比和第二百分比之和小于或者等于100%;Step E5, the associated data is divided into a first percentage training set and a second percentage verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;
从所述声纹特征向量与声纹特征的关联数据中分出一个训练集和一个验证集,所述训练集和验证集分别占所述关联数据的第一百分比和第二百分比,所述第一百分比和第二百分比之和小于或者等于100%,即可以是将整个关联数据刚好分成所述训练集和验证集,也可以是将所述关联数据中的一部分分成所述训练集和验证集;例如,所述第一百分比为65%,所述第二百分比为30%。And dividing a training set and a verification set from the associated data of the voiceprint feature vector and the voiceprint feature, wherein the training set and the verification set respectively occupy a first percentage and a second percentage of the associated data The sum of the first percentage and the second percentage is less than or equal to 100%, that is, the entire associated data may be just divided into the training set and the verification set, or part of the associated data may be Divided into the training set and the validation set; for example, the first percentage is 65% and the second percentage is 30%.
步骤E6,利用训练集中的声学特征向量与声纹特征的关联数据对所述预设类型识别模型进行训练,并在训练完成后利用验证集对训练的所述预设类型识别模型的准确率进行验证;Step E6: training the preset type recognition model by using the correlation data between the acoustic feature vector and the voiceprint feature in the training set, and using the verification set to perform the accuracy of the preset type recognition model of the training after the training is completed. verification;
系统通过采用训练集中的声学特征向量与声纹特征的关联数据对所述预设类型识别模型进行训练,在所述预设类型识别模型训练完成后,再通过验证集对该预设类型识别模型进行准确率的验证。The system trains the preset type recognition model by using the associated data of the acoustic feature vector and the voiceprint feature in the training set. After the training of the preset type recognition model is completed, the preset type recognition model is passed through the verification set. Verify the accuracy.
步骤E7,若准确率大于预设阈值,则模型训练结束;In step E7, if the accuracy rate is greater than the preset threshold, the model training ends;
如果经过验证集对该预设类型识别模型的验证,得到的准确率超过了预设阈值(例如,98.5%),则说明对该预设类型识别模型的训 练效果已经达到了预期标准,则结束模型训练,语音合成系统可将该训练的预设类型识别模型进行运用。If the verification of the preset type recognition model by the verification set exceeds a preset threshold (for example, 98.5%), the training of the preset type recognition model is indicated. After the training effect has reached the expected standard, the model training is ended, and the speech synthesis system can apply the preset type recognition model of the training.
步骤E8,若准确率小于或者等于预设阈值,则增加训练文本和对应的训练语音的数量,并基于增加后的训练文本和对应的训练语音重新执行上述步骤E2、E3、E4、E5和E6。Step E8: If the accuracy rate is less than or equal to the preset threshold, increase the number of the training text and the corresponding training voice, and re-execute the foregoing steps E2, E3, E4, E5, and E6 based on the added training text and the corresponding training voice. .
如果经过验证集对该预设类型识别模型的验证,得到的准确率小于或等于预设阈值,说明对该预设类型识别模型的训练效果还没有达到预期标准,可能是训练集数量不够或验证集数量不够,所以,在这种情况时,则增加训练文本和对应的训练语音的数量(例如,每次增加固定数量或每次增加随机数量),然后在这基础上,重新执行上述步骤E2、E3、E4、E5和E6,如此循环执行,直至达到了步骤E7的要求,则结束模型训练。If the verification of the preset type identification model is verified by the verification set, the obtained accuracy rate is less than or equal to the preset threshold, indicating that the training effect of the preset type identification model has not reached the expected standard, and the number of training sets may not be sufficient or verified. The number of sets is not enough, so in this case, increase the number of training texts and corresponding training voices (for example, increase the fixed number each time or increase the random number each time), and then re-execute the above step E2 based on this. , E3, E4, E5, and E6, are executed in this loop until the requirement of step E7 is reached, and the model training is ended.
本实施例优选所述预设滤波器为Mel滤波器(梅尔滤波器);上述步骤E4中,利用预设滤波器对各个训练语音进行处理以提取出各个训练语音的预设类型声纹特征的步骤包括:In this embodiment, the preset filter is a Mel filter (Mel filter); in the above step E4, each training speech is processed by using a preset filter to extract a preset type of voiceprint feature of each training voice. The steps include:
将各个训练语音进行预加重、分帧和加窗处理;Pre-emphasis, framing, and windowing of each training speech;
首先对各个训练语音均进行预加重、分帧及加窗处理;其中,预加重就是对训练语音高频分量进行补偿。Firstly, each training speech is pre-emphasized, framing and windowing; wherein pre-emphasis is to compensate the high-frequency components of the training speech.
对每一个加窗,通过傅立叶变换得到对应的频谱;For each windowing, the corresponding spectrum is obtained by Fourier transform;
然后,再对各个训练语音的每一个加窗进行傅立叶变换(即FFT变换),以得到对应的频谱。Then, each window of each training speech is subjected to Fourier transform (ie, FFT transform) to obtain a corresponding spectrum.
将得到的频谱通过Mel滤波器得到Mel频谱;The obtained spectrum is obtained by a Mel filter to obtain a Mel spectrum;
接着将经傅里叶变换得到的频谱通过Mel滤波器,如此得到Mel频谱。The spectrum obtained by the Fourier transform is then passed through a Mel filter, thus obtaining the Mel spectrum.
在Mel频谱上面进行倒谱分析,获得Mel频率倒谱系数MFCC,该MFCC就是这帧语音的声纹特征。The cepstrum analysis is performed on the Mel spectrum to obtain the Mel frequency cepstrum coefficient MFCC, which is the voiceprint feature of the speech of the frame.
本实施例的倒谱分析包括取对数、做逆变换,实际逆变换一般是通过DCT离散余弦变换来实现,取DCT后的第2个到第13个系数作为MFCC系数。The cepstrum analysis of this embodiment includes taking logarithm and inverse transform. The actual inverse transform is generally implemented by DCT discrete cosine transform, and the second to thirteenth coefficients after DCT are taken as MFCC coefficients.
本申请还提出一种计算机可读存储介质,该计算机可读存储介质存储有语音合成系统,所述语音合成系统可被至少一个处理器执行,以使所述至少一个处理器执行上述任一实施例中的语音合成方法。The present application also provides a computer readable storage medium storing a speech synthesis system, the speech synthesis system being executable by at least one processor to cause the at least one processor to perform any of the above implementations The speech synthesis method in the example.
以上所述仅为本发明的优选实施例,并非因此限制本发明的专利范围,凡是在本发明的发明构思下,利用本发明说明书及附图内容所作的等效结构变换,或直接/间接运用在其他相关的技术领域均包括在本发明的专利保护范围内。 The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the invention, and the equivalent structural transformation, or direct/indirect use, of the present invention and the contents of the drawings are used in the inventive concept of the present invention. It is included in the scope of the patent protection of the present invention in other related technical fields.

Claims (20)

  1. 一种电子装置,其特征在于,所述电子装置包括存储器、处理器,所述存储器上存储有可在所述处理器上运行的语音合成系统,所述语音合成系统被所述处理器执行时实现如下步骤:An electronic device, comprising: a memory, a processor, wherein the memory stores a speech synthesis system operable on the processor, when the speech synthesis system is executed by the processor Implement the following steps:
    在收到待进行语音合成的待合成文本后,将该待合成文本中的语句及词组拆分成单字,根据预先确定的单字、发音时长、发音基频三者之间的映射关系,确定各个单字对应的发音时长和发音基频,根据预先确定的发音字典将各个单字拆分成预设类型语音特征,确定出该待合成文本对应的各个单字的语音特征;After receiving the text to be synthesized for speech synthesis, the sentence and the phrase in the text to be synthesized are split into single words, and each mapping is determined according to a mapping relationship between the predetermined word, the length of the pronunciation, and the fundamental frequency of the pronunciation. The pronunciation duration and the pronunciation fundamental frequency of the single word are divided into preset type speech features according to a predetermined pronunciation dictionary, and the speech features of the individual words corresponding to the text to be synthesized are determined;
    根据该待合成文本对应的各个单字的语音特征和发音时长,提取出该待合成文本对应的预设类型声学特征向量;Extracting a preset type acoustic feature vector corresponding to the to-be-synthesized text according to a voice feature and a pronunciation duration of each single word corresponding to the text to be synthesized;
    将该待合成文本对应的预设类型声学特征向量输入到训练好的预设类型识别模型中,识别出该待合成文本对应的声纹特征;Inputting the preset type acoustic feature vector corresponding to the text to be synthesized into the trained preset type recognition model, and identifying the voiceprint feature corresponding to the text to be synthesized;
    根据该待合成文本对应的声纹特征和各个单字的发音基频,生成该待合成文本对应的语音。The voice corresponding to the text to be synthesized is generated according to the voiceprint feature corresponding to the text to be synthesized and the pronunciation fundamental frequency of each single word.
  2. 如权利要求1所述的电子装置,其特征在于,所述预设类型识别模型的训练过程如下:The electronic device according to claim 1, wherein the training process of the preset type recognition model is as follows:
    E1、获取预设数量的训练文本和对应的训练语音;E1: acquiring a preset number of training texts and corresponding training voices;
    E2、将各个训练文本中的语句及词组拆分成单字,根据预先确定的发音字典将各个单字拆分成预设类型语音特征,确定出各个训练文本对应的各个单字的语音特征;E2, splitting the sentences and phrases in each training text into single words, and splitting each single word into preset type speech features according to a predetermined pronunciation dictionary, and determining the voice features of each single word corresponding to each training text;
    E3、根据预先确定的单字和发音时长之间的映射关系,确定各个单字对应的发音时长,根据各个训练文本对应的各个单字的语音特征和发音时长,提取出各个训练文本对应的预设类型声学特征向量;E3. Determine a pronunciation duration corresponding to each single word according to a mapping relationship between the predetermined word and the length of the pronunciation, and extract a preset type acoustic corresponding to each training text according to the voice features and the length of the pronunciation of each single word corresponding to each training text. Feature vector;
    E4、利用预设滤波器对各个训练语音进行处理以提取出各个训练语音的预设类型声纹特征,根据训练文本和训练语音的映射关系,将各个训练文本的声学特征向量与对应的训练语音的声纹特征进行关联,得到声学特征向量与声纹特征的关联数据;E4: processing each training speech by using a preset filter to extract a preset type of voiceprint feature of each training voice, and according to the mapping relationship between the training text and the training voice, the acoustic feature vector of each training text and the corresponding training voice Correlation of the voiceprint features to obtain correlation data between the acoustic feature vector and the voiceprint feature;
    E5、将所述关联数据分为第一百分比的训练集和第二百分比的验证集,所述第一百分比和第二百分比之和小于或者等于100%;E5. The associated data is divided into a first percentage training set and a second percentage verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;
    E6、利用训练集中的声学特征向量与声纹特征的关联数据对所述预设类型识别模型进行训练,并在训练完成后利用验证集对训练的所述预设类型识别模型的准确率进行验证;E6. training the preset type recognition model by using the correlation data between the acoustic feature vector and the voiceprint feature in the training set, and verifying the accuracy of the preset type recognition model of the training by using the verification set after the training is completed. ;
    E7、若准确率大于预设阈值,则模型训练结束;E7. If the accuracy rate is greater than a preset threshold, the model training ends;
    E8、若准确率小于或者等于预设阈值,则增加训练文本和对应的训练语音的数量,并基于增加后的训练文本和对应的训练语音重新执行上述步骤E2、E3、E4、E5和E6。E8. If the accuracy rate is less than or equal to the preset threshold, increase the number of the training text and the corresponding training voice, and re-execute the foregoing steps E2, E3, E4, E5, and E6 based on the added training text and the corresponding training voice.
  3. 如权利要求2所述的电子装置,其特征在于,所述预设滤波 器为Mel滤波器,所述利用预设滤波器对各个训练语音进行处理以提取出各个训练语音的预设类型声纹特征的步骤包括:The electronic device according to claim 2, wherein said predetermined filtering The device is a Mel filter, and the step of processing each training speech by using a preset filter to extract a preset type of voiceprint feature of each training voice includes:
    将各个训练语音进行预加重、分帧和加窗处理;Pre-emphasis, framing, and windowing of each training speech;
    对每一个加窗,通过傅立叶变换得到对应的频谱;For each windowing, the corresponding spectrum is obtained by Fourier transform;
    将得到的频谱通过Mel滤波器得到Mel频谱;The obtained spectrum is obtained by a Mel filter to obtain a Mel spectrum;
    在Mel频谱上面进行倒谱分析,获得Mel频率倒谱系数MFCC,该MFCC就是这帧语音的声纹特征。The cepstrum analysis is performed on the Mel spectrum to obtain the Mel frequency cepstrum coefficient MFCC, which is the voiceprint feature of the speech of the frame.
  4. 如权利要求3所述的电子装置,其特征在于,所述倒谱分析包括取对数和做逆变换。The electronic device of claim 3 wherein said cepstrum analysis comprises taking a logarithm and performing an inverse transform.
  5. 如权利要求1所述的电子装置,其特征在于,所述预设类型识别模型为深度前馈网络模型,该深度前馈网络模型是一个五层的神经网络,各层的神经元节点数目分别为:136L-75N-25S-75N-25L,L表示采用线性激活函数,N表示采用正切性激活函数,S表示采用sigmoid激活函数。The electronic device according to claim 1, wherein the preset type recognition model is a deep feedforward network model, and the deep feedforward network model is a five-layer neural network, and the number of neuron nodes in each layer is respectively For: 136L-75N-25S-75N-25L, L means a linear activation function, N means a tangent activation function, and S means a sigmoid activation function.
  6. 如权利要求5所述的电子装置,其特征在于,所述预设类型识别模型的训练过程如下:The electronic device according to claim 5, wherein the training process of the preset type recognition model is as follows:
    E1、获取预设数量的训练文本和对应的训练语音;E1: acquiring a preset number of training texts and corresponding training voices;
    E2、将各个训练文本中的语句及词组拆分成单字,根据预先确定的发音字典将各个单字拆分成预设类型语音特征,确定出各个训练文本对应的各个单字的语音特征;E2, splitting the sentences and phrases in each training text into single words, and splitting each single word into preset type speech features according to a predetermined pronunciation dictionary, and determining the voice features of each single word corresponding to each training text;
    E3、根据预先确定的单字和发音时长之间的映射关系,确定各个单字对应的发音时长,根据各个训练文本对应的各个单字的语音特征和发音时长,提取出各个训练文本对应的预设类型声学特征向量;E3. Determine a pronunciation duration corresponding to each single word according to a mapping relationship between the predetermined word and the length of the pronunciation, and extract a preset type acoustic corresponding to each training text according to the voice features and the length of the pronunciation of each single word corresponding to each training text. Feature vector;
    E4、利用预设滤波器对各个训练语音进行处理以提取出各个训练语音的预设类型声纹特征,根据训练文本和训练语音的映射关系,将各个训练文本的声学特征向量与对应的训练语音的声纹特征进行关联,得到声学特征向量与声纹特征的关联数据;E4: processing each training speech by using a preset filter to extract a preset type of voiceprint feature of each training voice, and according to the mapping relationship between the training text and the training voice, the acoustic feature vector of each training text and the corresponding training voice Correlation of the voiceprint features to obtain correlation data between the acoustic feature vector and the voiceprint feature;
    E5、将所述关联数据分为第一百分比的训练集和第二百分比的验证集,所述第一百分比和第二百分比之和小于或者等于100%;E5. The associated data is divided into a first percentage training set and a second percentage verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;
    E6、利用训练集中的声学特征向量与声纹特征的关联数据对所述预设类型识别模型进行训练,并在训练完成后利用验证集对训练的所述预设类型识别模型的准确率进行验证;E6. training the preset type recognition model by using the correlation data between the acoustic feature vector and the voiceprint feature in the training set, and verifying the accuracy of the preset type recognition model of the training by using the verification set after the training is completed. ;
    E7、若准确率大于预设阈值,则模型训练结束;E7. If the accuracy rate is greater than a preset threshold, the model training ends;
    E8、若准确率小于或者等于预设阈值,则增加训练文本和对应的训练语音的数量,并基于增加后的训练文本和对应的训练语音重新执行上述步骤E2、E3、E4、E5和E6。E8. If the accuracy rate is less than or equal to the preset threshold, increase the number of the training text and the corresponding training voice, and re-execute the foregoing steps E2, E3, E4, E5, and E6 based on the added training text and the corresponding training voice.
  7. 如权利要求6所述的电子装置,其特征在于,所述预设滤波器为Mel滤波器,所述利用预设滤波器对各个训练语音进行处理以提 取出各个训练语音的预设类型声纹特征的步骤包括:The electronic device according to claim 6, wherein the preset filter is a Mel filter, and the predetermined training filter is used to process each training voice. The steps of taking out the preset type voiceprint features of each training voice include:
    将各个训练语音进行预加重、分帧和加窗处理;Pre-emphasis, framing, and windowing of each training speech;
    对每一个加窗,通过傅立叶变换得到对应的频谱;For each windowing, the corresponding spectrum is obtained by Fourier transform;
    将得到的频谱通过Mel滤波器得到Mel频谱;The obtained spectrum is obtained by a Mel filter to obtain a Mel spectrum;
    在Mel频谱上面进行倒谱分析,获得Mel频率倒谱系数MFCC,该MFCC就是这帧语音的声纹特征。The cepstrum analysis is performed on the Mel spectrum to obtain the Mel frequency cepstrum coefficient MFCC, which is the voiceprint feature of the speech of the frame.
  8. 如权利要求7所述的电子装置,其特征在于,所述倒谱分析包括取对数和做逆变换。The electronic device of claim 7 wherein said cepstrum analysis comprises taking a logarithm and performing an inverse transform.
  9. 一种自动合成语音方法,其特征在于,该方法包括步骤:An automatic synthesized speech method, characterized in that the method comprises the steps of:
    在收到待进行语音合成的待合成文本后,将该待合成文本中的语句及词组拆分成单字,根据预先确定的单字、发音时长、发音基频三者之间的映射关系,确定各个单字对应的发音时长和发音基频,根据预先确定的发音字典将各个单字拆分成预设类型语音特征,确定出该待合成文本对应的各个单字的语音特征;After receiving the text to be synthesized for speech synthesis, the sentence and the phrase in the text to be synthesized are split into single words, and each mapping is determined according to a mapping relationship between the predetermined word, the length of the pronunciation, and the fundamental frequency of the pronunciation. The pronunciation duration and the pronunciation fundamental frequency of the single word are divided into preset type speech features according to a predetermined pronunciation dictionary, and the speech features of the individual words corresponding to the text to be synthesized are determined;
    根据该待合成文本对应的各个单字的语音特征和发音时长,提取出该待合成文本对应的预设类型声学特征向量;Extracting a preset type acoustic feature vector corresponding to the to-be-synthesized text according to a voice feature and a pronunciation duration of each single word corresponding to the text to be synthesized;
    将该待合成文本对应的预设类型声学特征向量输入到训练好的预设类型识别模型中,识别出该待合成文本对应的声纹特征;Inputting the preset type acoustic feature vector corresponding to the text to be synthesized into the trained preset type recognition model, and identifying the voiceprint feature corresponding to the text to be synthesized;
    根据该待合成文本对应的声纹特征和各个单字的发音基频,生成该待合成文本对应的语音。The voice corresponding to the text to be synthesized is generated according to the voiceprint feature corresponding to the text to be synthesized and the pronunciation fundamental frequency of each single word.
  10. 如权利要求9所述的语音合成方法,其特征在于,所述预设类型识别模型的训练过程如下:The speech synthesis method according to claim 9, wherein the training process of the preset type recognition model is as follows:
    E1、获取预设数量的训练文本和对应的训练语音;E1: acquiring a preset number of training texts and corresponding training voices;
    E2、将各个训练文本中的语句及词组拆分成单字,根据预先确定的发音字典将各个单字拆分成预设类型语音特征,确定出各个训练文本对应的各个单字的语音特征;E2, splitting the sentences and phrases in each training text into single words, and splitting each single word into preset type speech features according to a predetermined pronunciation dictionary, and determining the voice features of each single word corresponding to each training text;
    E3、根据预先确定的单字和发音时长之间的映射关系,确定各个单字对应的发音时长,根据各个训练文本对应的各个单字的语音特征和发音时长,提取出各个训练文本对应的预设类型声学特征向量;E3. Determine a pronunciation duration corresponding to each single word according to a mapping relationship between the predetermined word and the length of the pronunciation, and extract a preset type acoustic corresponding to each training text according to the voice features and the length of the pronunciation of each single word corresponding to each training text. Feature vector;
    E4、利用预设滤波器对各个训练语音进行处理以提取出各个训练语音的预设类型声纹特征,根据训练文本和训练语音的映射关系,将各个训练文本的声学特征向量与对应的训练语音的声纹特征进行关联,得到声学特征向量与声纹特征的关联数据;E4: processing each training speech by using a preset filter to extract a preset type of voiceprint feature of each training voice, and according to the mapping relationship between the training text and the training voice, the acoustic feature vector of each training text and the corresponding training voice Correlation of the voiceprint features to obtain correlation data between the acoustic feature vector and the voiceprint feature;
    E5、将所述关联数据分为第一百分比的训练集和第二百分比的验证集,所述第一百分比和第二百分比之和小于或者等于100%;E5. The associated data is divided into a first percentage training set and a second percentage verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;
    E6、利用训练集中的声学特征向量与声纹特征的关联数据对所述预设类型识别模型进行训练,并在训练完成后利用验证集对训练的所述预设类型识别模型的准确率进行验证; E6. training the preset type recognition model by using the correlation data between the acoustic feature vector and the voiceprint feature in the training set, and verifying the accuracy of the preset type recognition model of the training by using the verification set after the training is completed. ;
    E7、若准确率大于预设阈值,则模型训练结束;E7. If the accuracy rate is greater than a preset threshold, the model training ends;
    E8、若准确率小于或者等于预设阈值,则增加训练文本和对应的训练语音的数量,并基于增加后的训练文本和对应的训练语音重新执行上述步骤E2、E3、E4、E5和E6。E8. If the accuracy rate is less than or equal to the preset threshold, increase the number of the training text and the corresponding training voice, and re-execute the foregoing steps E2, E3, E4, E5, and E6 based on the added training text and the corresponding training voice.
  11. 如权利要求10所述的语音合成方法,其特征在于,所述预设滤波器为Mel滤波器,所述利用预设滤波器对各个训练语音进行处理以提取出各个训练语音的预设类型声纹特征的步骤包括:The speech synthesis method according to claim 10, wherein the preset filter is a Mel filter, and the predetermined training filter is used to process each training speech to extract a preset type of sound of each training speech. The steps of the pattern feature include:
    将各个训练语音进行预加重、分帧和加窗处理;Pre-emphasis, framing, and windowing of each training speech;
    对每一个加窗,通过傅立叶变换得到对应的频谱;For each windowing, the corresponding spectrum is obtained by Fourier transform;
    将得到的频谱通过Mel滤波器得到Mel频谱;The obtained spectrum is obtained by a Mel filter to obtain a Mel spectrum;
    在Mel频谱上面进行倒谱分析,获得Mel频率倒谱系数MFCC,该MFCC就是这帧语音的声纹特征。The cepstrum analysis is performed on the Mel spectrum to obtain the Mel frequency cepstrum coefficient MFCC, which is the voiceprint feature of the speech of the frame.
  12. 如权利要求11所述的语音合成方法,其特征在于,所述倒谱分析包括取对数和做逆变换。The speech synthesis method according to claim 11, wherein said cepstrum analysis comprises taking a logarithm and performing an inverse transform.
  13. 如权利要求9所述的语音合成方法,其特征在于,所述预设类型识别模型为深度前馈网络模型,该深度前馈网络模型是一个五层的神经网络,各层的神经元节点数目分别为:136L-75N-25S-75N-25L,L表示采用线性激活函数,N表示采用正切性激活函数,S表示采用sigmoid激活函数。The speech synthesis method according to claim 9, wherein the preset type recognition model is a deep feedforward network model, and the deep feedforward network model is a five-layer neural network, and the number of neuron nodes in each layer They are: 136L-75N-25S-75N-25L, L means a linear activation function, N means a tangent activation function, and S means a sigmoid activation function.
  14. 如权利要求13所述的语音合成方法,其特征在于,所述预设类型识别模型的训练过程如下:The speech synthesis method according to claim 13, wherein the training process of the preset type recognition model is as follows:
    E1、获取预设数量的训练文本和对应的训练语音;E1: acquiring a preset number of training texts and corresponding training voices;
    E2、将各个训练文本中的语句及词组拆分成单字,根据预先确定的发音字典将各个单字拆分成预设类型语音特征,确定出各个训练文本对应的各个单字的语音特征;E2, splitting the sentences and phrases in each training text into single words, and splitting each single word into preset type speech features according to a predetermined pronunciation dictionary, and determining the voice features of each single word corresponding to each training text;
    E3、根据预先确定的单字和发音时长之间的映射关系,确定各个单字对应的发音时长,根据各个训练文本对应的各个单字的语音特征和发音时长,提取出各个训练文本对应的预设类型声学特征向量;E3. Determine a pronunciation duration corresponding to each single word according to a mapping relationship between the predetermined word and the length of the pronunciation, and extract a preset type acoustic corresponding to each training text according to the voice features and the length of the pronunciation of each single word corresponding to each training text. Feature vector;
    E4、利用预设滤波器对各个训练语音进行处理以提取出各个训练语音的预设类型声纹特征,根据训练文本和训练语音的映射关系,将各个训练文本的声学特征向量与对应的训练语音的声纹特征进行关联,得到声学特征向量与声纹特征的关联数据;E4: processing each training speech by using a preset filter to extract a preset type of voiceprint feature of each training voice, and according to the mapping relationship between the training text and the training voice, the acoustic feature vector of each training text and the corresponding training voice Correlation of the voiceprint features to obtain correlation data between the acoustic feature vector and the voiceprint feature;
    E5、将所述关联数据分为第一百分比的训练集和第二百分比的验证集,所述第一百分比和第二百分比之和小于或者等于100%;E5. The associated data is divided into a first percentage training set and a second percentage verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;
    E6、利用训练集中的声学特征向量与声纹特征的关联数据对所述预设类型识别模型进行训练,并在训练完成后利用验证集对训练的所述预设类型识别模型的准确率进行验证;E6. training the preset type recognition model by using the correlation data between the acoustic feature vector and the voiceprint feature in the training set, and verifying the accuracy of the preset type recognition model of the training by using the verification set after the training is completed. ;
    E7、若准确率大于预设阈值,则模型训练结束; E7. If the accuracy rate is greater than a preset threshold, the model training ends;
    E8、若准确率小于或者等于预设阈值,则增加训练文本和对应的训练语音的数量,并基于增加后的训练文本和对应的训练语音重新执行上述步骤E2、E3、E4、E5和E6。E8. If the accuracy rate is less than or equal to the preset threshold, increase the number of the training text and the corresponding training voice, and re-execute the foregoing steps E2, E3, E4, E5, and E6 based on the added training text and the corresponding training voice.
  15. 如权利要求14所述的语音合成方法,其特征在于,所述预设滤波器为Mel滤波器,所述利用预设滤波器对各个训练语音进行处理以提取出各个训练语音的预设类型声纹特征的步骤包括:The speech synthesis method according to claim 14, wherein the preset filter is a Mel filter, and the predetermined training filter is used to process each training speech to extract a preset type of sound of each training speech. The steps of the pattern feature include:
    将各个训练语音进行预加重、分帧和加窗处理;Pre-emphasis, framing, and windowing of each training speech;
    对每一个加窗,通过傅立叶变换得到对应的频谱;For each windowing, the corresponding spectrum is obtained by Fourier transform;
    将得到的频谱通过Mel滤波器得到Mel频谱;The obtained spectrum is obtained by a Mel filter to obtain a Mel spectrum;
    在Mel频谱上面进行倒谱分析,获得Mel频率倒谱系数MFCC,该MFCC就是这帧语音的声纹特征。The cepstrum analysis is performed on the Mel spectrum to obtain the Mel frequency cepstrum coefficient MFCC, which is the voiceprint feature of the speech of the frame.
  16. 如权利要求15所述的语音合成方法,其特征在于,所述倒谱分析包括取对数和做逆变换。The speech synthesis method according to claim 15, wherein said cepstrum analysis comprises taking a logarithm and performing an inverse transform.
  17. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有语音合成系统,所述语音合成系统可被至少一个处理器执行,以使所述至少一个处理器执行如下步骤:A computer readable storage medium, characterized in that the computer readable storage medium stores a speech synthesis system, the speech synthesis system being executable by at least one processor to cause the at least one processor to perform the following steps:
    在收到待进行语音合成的待合成文本后,将该待合成文本中的语句及词组拆分成单字,根据预先确定的单字、发音时长、发音基频三者之间的映射关系,确定各个单字对应的发音时长和发音基频,根据预先确定的发音字典将各个单字拆分成预设类型语音特征,确定出该待合成文本对应的各个单字的语音特征;After receiving the text to be synthesized for speech synthesis, the sentence and the phrase in the text to be synthesized are split into single words, and each mapping is determined according to a mapping relationship between the predetermined word, the length of the pronunciation, and the fundamental frequency of the pronunciation. The pronunciation duration and the pronunciation fundamental frequency of the single word are divided into preset type speech features according to a predetermined pronunciation dictionary, and the speech features of the individual words corresponding to the text to be synthesized are determined;
    根据该待合成文本对应的各个单字的语音特征和发音时长,提取出该待合成文本对应的预设类型声学特征向量;Extracting a preset type acoustic feature vector corresponding to the to-be-synthesized text according to a voice feature and a pronunciation duration of each single word corresponding to the text to be synthesized;
    将该待合成文本对应的预设类型声学特征向量输入到训练好的预设类型识别模型中,识别出该待合成文本对应的声纹特征;Inputting the preset type acoustic feature vector corresponding to the text to be synthesized into the trained preset type recognition model, and identifying the voiceprint feature corresponding to the text to be synthesized;
    根据该待合成文本对应的声纹特征和各个单字的发音基频,生成该待合成文本对应的语音。The voice corresponding to the text to be synthesized is generated according to the voiceprint feature corresponding to the text to be synthesized and the pronunciation fundamental frequency of each single word.
  18. 如权利要求17所述的计算机可读存储介质,其特征在于,所述预设类型识别模型为深度前馈网络模型,该深度前馈网络模型是一个五层的神经网络,各层的神经元节点数目分别为:136L-75N-25S-75N-25L,L表示采用线性激活函数,N表示采用正切性激活函数,S表示采用sigmoid激活函数。The computer readable storage medium according to claim 17, wherein the preset type recognition model is a deep feedforward network model, and the deep feedforward network model is a five-layer neural network, and each layer of neurons The number of nodes is: 136L-75N-25S-75N-25L, L means a linear activation function, N means a tangent activation function, and S means a sigmoid activation function.
  19. 如权利要求18所述的计算机可读存储介质,其特征在于,所述预设类型识别模型的训练过程如下:The computer readable storage medium of claim 18, wherein the training process of the preset type recognition model is as follows:
    E1、获取预设数量的训练文本和对应的训练语音;E1: acquiring a preset number of training texts and corresponding training voices;
    E2、将各个训练文本中的语句及词组拆分成单字,根据预先确定的发音字典将各个单字拆分成预设类型语音特征,确定出各个训练文本对应的各个单字的语音特征; E2, splitting the sentences and phrases in each training text into single words, and splitting each single word into preset type speech features according to a predetermined pronunciation dictionary, and determining the voice features of each single word corresponding to each training text;
    E3、根据预先确定的单字和发音时长之间的映射关系,确定各个单字对应的发音时长,根据各个训练文本对应的各个单字的语音特征和发音时长,提取出各个训练文本对应的预设类型声学特征向量;E3. Determine a pronunciation duration corresponding to each single word according to a mapping relationship between the predetermined word and the length of the pronunciation, and extract a preset type acoustic corresponding to each training text according to the voice features and the length of the pronunciation of each single word corresponding to each training text. Feature vector;
    E4、利用预设滤波器对各个训练语音进行处理以提取出各个训练语音的预设类型声纹特征,根据训练文本和训练语音的映射关系,将各个训练文本的声学特征向量与对应的训练语音的声纹特征进行关联,得到声学特征向量与声纹特征的关联数据;E4: processing each training speech by using a preset filter to extract a preset type of voiceprint feature of each training voice, and according to the mapping relationship between the training text and the training voice, the acoustic feature vector of each training text and the corresponding training voice Correlation of the voiceprint features to obtain correlation data between the acoustic feature vector and the voiceprint feature;
    E5、将所述关联数据分为第一百分比的训练集和第二百分比的验证集,所述第一百分比和第二百分比之和小于或者等于100%;E5. The associated data is divided into a first percentage training set and a second percentage verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;
    E6、利用训练集中的声学特征向量与声纹特征的关联数据对所述预设类型识别模型进行训练,并在训练完成后利用验证集对训练的所述预设类型识别模型的准确率进行验证;E6. training the preset type recognition model by using the correlation data between the acoustic feature vector and the voiceprint feature in the training set, and verifying the accuracy of the preset type recognition model of the training by using the verification set after the training is completed. ;
    E7、若准确率大于预设阈值,则模型训练结束;E7. If the accuracy rate is greater than a preset threshold, the model training ends;
    E8、若准确率小于或者等于预设阈值,则增加训练文本和对应的训练语音的数量,并基于增加后的训练文本和对应的训练语音重新执行上述步骤E2、E3、E4、E5和E6。E8. If the accuracy rate is less than or equal to the preset threshold, increase the number of the training text and the corresponding training voice, and re-execute the foregoing steps E2, E3, E4, E5, and E6 based on the added training text and the corresponding training voice.
  20. 如权利要求19所述的计算机可读存储介质,其特征在于,所述预设滤波器为Mel滤波器,所述利用预设滤波器对各个训练语音进行处理以提取出各个训练语音的预设类型声纹特征的步骤包括:The computer readable storage medium according to claim 19, wherein the preset filter is a Mel filter, and the preset training filter processes each training speech to extract a preset of each training voice. The steps of the type voiceprint feature include:
    将各个训练语音进行预加重、分帧和加窗处理;Pre-emphasis, framing, and windowing of each training speech;
    对每一个加窗,通过傅立叶变换得到对应的频谱;For each windowing, the corresponding spectrum is obtained by Fourier transform;
    将得到的频谱通过Mel滤波器得到Mel频谱;The obtained spectrum is obtained by a Mel filter to obtain a Mel spectrum;
    在Mel频谱上面进行倒谱分析,获得Mel频率倒谱系数MFCC,该MFCC就是这帧语音的声纹特征。 The cepstrum analysis is performed on the Mel spectrum to obtain the Mel frequency cepstrum coefficient MFCC, which is the voiceprint feature of the speech of the frame.
PCT/CN2017/108766 2017-09-25 2017-10-31 Electronic apparatus, speech synthesis method, and computer readable storage medium WO2019056500A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710874876.2 2017-09-25
CN201710874876.2A CN107564511B (en) 2017-09-25 2017-09-25 Electronic device, phoneme synthesizing method and computer readable storage medium

Publications (1)

Publication Number Publication Date
WO2019056500A1 true WO2019056500A1 (en) 2019-03-28

Family

ID=60982768

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/108766 WO2019056500A1 (en) 2017-09-25 2017-10-31 Electronic apparatus, speech synthesis method, and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN107564511B (en)
WO (1) WO2019056500A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210390943A1 (en) * 2020-06-15 2021-12-16 Beijing Baidu Netcom Science And Technology Co., Ltd. Method And Apparatus For Training Model, Method And Apparatus For Synthesizing Speech, Device And Storage Medium

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108630190B (en) * 2018-05-18 2019-12-10 百度在线网络技术(北京)有限公司 Method and apparatus for generating speech synthesis model
CN109346056B (en) * 2018-09-20 2021-06-11 中国科学院自动化研究所 Speech synthesis method and device based on depth measurement network
CN109584859A (en) * 2018-11-07 2019-04-05 上海指旺信息科技有限公司 Phoneme synthesizing method and device
CN109754778B (en) * 2019-01-17 2023-05-30 平安科技(深圳)有限公司 Text speech synthesis method and device and computer equipment
CN110164413B (en) * 2019-05-13 2021-06-04 北京百度网讯科技有限公司 Speech synthesis method, apparatus, computer device and storage medium
CN110767210A (en) * 2019-10-30 2020-02-07 四川长虹电器股份有限公司 Method and device for generating personalized voice
CN111161705B (en) * 2019-12-19 2022-11-18 寒武纪(西安)集成电路有限公司 Voice conversion method and device
CN111091807B (en) * 2019-12-26 2023-05-26 广州酷狗计算机科技有限公司 Speech synthesis method, device, computer equipment and storage medium
CN111508469A (en) * 2020-04-26 2020-08-07 北京声智科技有限公司 Text-to-speech conversion method and device
CN111429923B (en) * 2020-06-15 2020-09-29 深圳市友杰智新科技有限公司 Training method and device of speaker information extraction model and computer equipment
CN111968616A (en) * 2020-08-19 2020-11-20 浙江同花顺智能科技有限公司 Training method and device of speech synthesis model, electronic equipment and storage medium
CN112184859B (en) * 2020-09-01 2023-10-03 魔珐(上海)信息科技有限公司 End-to-end virtual object animation generation method and device, storage medium and terminal
CN112184858B (en) 2020-09-01 2021-12-07 魔珐(上海)信息科技有限公司 Virtual object animation generation method and device based on text, storage medium and terminal
CN112257407A (en) * 2020-10-20 2021-01-22 网易(杭州)网络有限公司 Method and device for aligning text in audio, electronic equipment and readable storage medium
CN113838450B (en) * 2021-08-11 2022-11-25 北京百度网讯科技有限公司 Audio synthesis and corresponding model training method, device, equipment and storage medium
CN117765926A (en) * 2024-02-19 2024-03-26 上海蜜度科技股份有限公司 Speech synthesis method, system, electronic equipment and medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050055207A1 (en) * 2000-03-31 2005-03-10 Canon Kabushiki Kaisha Speech information processing method and apparatus and storage medium using a segment pitch pattern model
CN101710488A (en) * 2009-11-20 2010-05-19 安徽科大讯飞信息科技股份有限公司 Method and device for voice synthesis
CN101894547A (en) * 2010-06-30 2010-11-24 北京捷通华声语音技术有限公司 Speech synthesis method and system

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101000765B (en) * 2007-01-09 2011-03-30 黑龙江大学 Speech synthetic method based on rhythm character
JP5025550B2 (en) * 2008-04-01 2012-09-12 株式会社東芝 Audio processing apparatus, audio processing method, and program
CN104538024B (en) * 2014-12-01 2019-03-08 百度在线网络技术(北京)有限公司 Phoneme synthesizing method, device and equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050055207A1 (en) * 2000-03-31 2005-03-10 Canon Kabushiki Kaisha Speech information processing method and apparatus and storage medium using a segment pitch pattern model
CN101710488A (en) * 2009-11-20 2010-05-19 安徽科大讯飞信息科技股份有限公司 Method and device for voice synthesis
CN101894547A (en) * 2010-06-30 2010-11-24 北京捷通华声语音技术有限公司 Speech synthesis method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210390943A1 (en) * 2020-06-15 2021-12-16 Beijing Baidu Netcom Science And Technology Co., Ltd. Method And Apparatus For Training Model, Method And Apparatus For Synthesizing Speech, Device And Storage Medium
US11769480B2 (en) * 2020-06-15 2023-09-26 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for training model, method and apparatus for synthesizing speech, device and storage medium

Also Published As

Publication number Publication date
CN107564511A (en) 2018-01-09
CN107564511B (en) 2018-09-11

Similar Documents

Publication Publication Date Title
WO2019056500A1 (en) Electronic apparatus, speech synthesis method, and computer readable storage medium
Ma et al. Short utterance based speech language identification in intelligent vehicles with time-scale modifications and deep bottleneck features
US9431011B2 (en) System and method for pronunciation modeling
US8321218B2 (en) Searching in audio speech
CN109686383B (en) Voice analysis method, device and storage medium
US20160049144A1 (en) System and method for unified normalization in text-to-speech and automatic speech recognition
Stan et al. TUNDRA: a multilingual corpus of found data for TTS research created with light supervision
JP2001282282A (en) Method and device for voice information processing and storage medium
Kirchner et al. Computing phonological generalization over real speech exemplars
Black et al. Automated evaluation of non-native English pronunciation quality: combining knowledge-and data-driven features at multiple time scales
Qian et al. Capturing L2 segmental mispronunciations with joint-sequence models in computer-aided pronunciation training (CAPT)
Bhatt et al. Continuous speech recognition technologies—a review
Miodonska et al. Dynamic time warping in phoneme modeling for fast pronunciation error detection
Meng et al. Synthesizing English emphatic speech for multimodal corrective feedback in computer-aided pronunciation training
Oura et al. Unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis
Toyama et al. Use of Global and Acoustic Features Associated with Contextual Factors to Adapt Language Models for Spontaneous Speech Recognition.
Ekpenyong et al. Improved syllable-based text to speech synthesis for tone language systems
Giwa et al. The effect of language identification accuracy on speech recognition accuracy of proper names
Dai An automatic pronunciation error detection and correction mechanism in English teaching based on an improved random forest model
Park et al. Jejueo datasets for machine translation and speech synthesis
CN112329484A (en) Translation method and device for natural language
AbuZeina et al. Arabic speech recognition systems
Heo et al. Classification based on speech rhythm via a temporal alignment of spoken sentences
US20230215421A1 (en) End-to-end neural text-to-speech model with prosody control
Boháč et al. Automatic syllabification and syllable timing of automatically recognized speech–for czech

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17926126

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 24.09.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 17926126

Country of ref document: EP

Kind code of ref document: A1