WO2019056500A1 - Appareil électronique, procédé de synthèse vocale, et support de stockage lisible par ordinateur - Google Patents

Appareil électronique, procédé de synthèse vocale, et support de stockage lisible par ordinateur Download PDF

Info

Publication number
WO2019056500A1
WO2019056500A1 PCT/CN2017/108766 CN2017108766W WO2019056500A1 WO 2019056500 A1 WO2019056500 A1 WO 2019056500A1 CN 2017108766 W CN2017108766 W CN 2017108766W WO 2019056500 A1 WO2019056500 A1 WO 2019056500A1
Authority
WO
WIPO (PCT)
Prior art keywords
training
text
pronunciation
speech
preset type
Prior art date
Application number
PCT/CN2017/108766
Other languages
English (en)
Chinese (zh)
Inventor
梁浩
程宁
王健宗
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019056500A1 publication Critical patent/WO2019056500A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Definitions

  • the present application relates to the field of voice technologies, and in particular, to an electronic device, a voice synthesis method, and a computer readable storage medium.
  • Speech synthesis technology also known as text to speech (speech synthesis, TTS), aims to make the text information into artificial speech output through recognition and understanding, which is an important branch of modern artificial intelligence development. .
  • Speech synthesis can play a great role in quality detection, machine question and answer, disability assistance and other fields, which is convenient for people's life.
  • the naturalness and clarity of speech synthesis directly determine the effectiveness of technical application.
  • the existing speech synthesis scheme usually uses traditional mixed Gaussian technology to construct speech units.
  • speech synthesis is basically to complete a modeling mapping from morpheme (linguistic space) to phoneme (acoustic space).
  • a complex nonlinear mode mapping, using traditional hybrid Gaussian technology can not achieve high-precision, high-depth feature mining and expression, easy to make mistakes
  • the present application provides an electronic device, a speech synthesis method, and a computer readable storage medium, which are intended to have high precision, naturalness, and clarity of speech synthesis results.
  • a first aspect of the present application provides an electronic device, including a memory, a processor, and a memory synthesis system executable on the processor, where the voice synthesis system is implemented by the processor step:
  • the sentence and the phrase in the text to be synthesized are split into single words, and each mapping is determined according to a mapping relationship between the predetermined word, the length of the pronunciation, and the fundamental frequency of the pronunciation.
  • the pronunciation duration and the pronunciation fundamental frequency of the single word are divided into preset type speech features according to a predetermined pronunciation dictionary, and the speech features of the individual words corresponding to the text to be synthesized are determined;
  • a second aspect of the present application provides an automatic synthesized speech method, the method comprising the steps of:
  • the sentence and the phrase in the text to be synthesized are split into single words, and each mapping is determined according to a mapping relationship between the predetermined word, the length of the pronunciation, and the fundamental frequency of the pronunciation.
  • the pronunciation duration and the pronunciation fundamental frequency of the single word are divided into preset type speech features according to a predetermined pronunciation dictionary, and the speech features of the individual words corresponding to the text to be synthesized are determined;
  • the voice corresponding to the text to be synthesized is generated according to the voiceprint feature corresponding to the text to be synthesized and the pronunciation fundamental frequency of each single word.
  • a third aspect of the present application provides a computer readable storage medium storing a speech synthesis system, the speech synthesis system being executable by at least one processor to cause the at least one processor to execute as follows step:
  • the sentence and the phrase in the text to be synthesized are split into single words, and each mapping is determined according to a mapping relationship between the predetermined word, the length of the pronunciation, and the fundamental frequency of the pronunciation.
  • the pronunciation duration and the pronunciation fundamental frequency of the single word are divided into preset type speech features according to a predetermined pronunciation dictionary, and the speech features of the individual words corresponding to the text to be synthesized are determined;
  • the voice corresponding to the text to be synthesized is generated according to the voiceprint feature corresponding to the text to be synthesized and the pronunciation fundamental frequency of each single word.
  • the technical solution of the present application first splits the phrases and sentences in the text to be synthesized into single words, and determines the pronunciation fundamental frequency, pronunciation duration and speech features corresponding to each single word; then, according to the speech features and pronunciations of the individual words corresponding to the text to be synthesized
  • the preset type acoustic feature vector corresponding to the text to be synthesized is extracted; and the extracted preset type acoustic feature vector is identified by the trained preset type recognition model, thereby identifying the sound corresponding to the text to be synthesized
  • the pattern is obtained; finally, according to the voiceprint feature corresponding to the text to be synthesized and the pronunciation fundamental frequency of each word, the voice corresponding to the text to be synthesized is generated.
  • the present invention identifies a voiceprint feature corresponding to the text to be synthesized by using a trained preset type recognition model, and the preset type recognition model is a large amount of data in advance. Has been trained to be completed, therefore, the accuracy of the voiceprint feature corresponding to the recognized text to be synthesized is high, and further, according to the voiceprint corresponding to the text to be synthesized.
  • the feature and the pronunciation base frequency of each single word, the generated speech corresponding to the text to be synthesized, the naturalness and the definition are better, and are not easy to make mistakes.
  • FIG. 1 is a schematic flow chart of a preferred embodiment of a speech synthesis method according to the present application.
  • FIG. 2 is a schematic flowchart of a training process of a preset type recognition model in a preferred embodiment of the speech synthesis method of the present application;
  • FIG. 3 is a schematic diagram of an operating environment of a preferred embodiment of a speech synthesis system of the present application
  • FIG. 4 is a block diagram of a program of a preferred embodiment of a speech synthesis system of the present application.
  • FIG. 1 is a schematic flowchart of a voice synthesizing method according to a preferred embodiment of the present application.
  • the voice synthesis method includes:
  • Step S10 after receiving the text to be synthesized for synthesizing the speech, splitting the sentence and the phrase in the text to be synthesized into a single word, according to a mapping relationship between the predetermined word, the length of the pronunciation, and the fundamental frequency of the pronunciation. Determining a pronunciation duration and a pronunciation fundamental frequency corresponding to each single word, and dividing each single word into a preset type speech feature according to a predetermined pronunciation dictionary, and determining a speech feature of each single word corresponding to the to-be-synthesized text;
  • Pronunciation fundamental frequency Sometimes it can also be called pitch, which refers to the fundamental frequency of pronunciation.
  • pitch refers to the fundamental frequency of pronunciation.
  • the sound is basically composed of many sine waves with different frequencies, and the lowest frequency sine wave is the fundamental frequency.
  • Phoneme refers to the smallest phonetic unit based on the natural attributes of speech. From the perspective of acoustic properties, phoneme is the smallest unit of speech divided from the perspective of sound quality. From the physiological point of view, a pronunciation action forms a phoneme, such as [ Ma] contains two sounding actions [m] and [a]. The two phonemes are the same phoneme.
  • the sounds emitted by different pronunciation actions are different phonemes, such as [ma-mi], two [m]
  • the pronunciation is the same, it is the same phoneme, and [a][i] has different pronunciations and is different phonemes.
  • "Mandarin” consists of three syllables “pu, tong, hua”, which can be analyzed into “p, u, t, o, ng, h, u, a” eight phonemes.
  • the pronunciation fundamental frequency and the pronunciation duration (ie, the sound length) of the single word may be determined by a pre-trained model, such as by a pre-trained Hidden Markov Model (HMM); the preset type Speech features, for example, may include syllables, phonemes, initials, and finals.
  • HMM Hidden Markov Model
  • the speech synthesis system After receiving the text to be synthesized for speech synthesis, the speech synthesis system splits the text sentence and the phrase in the text to be synthesized, and splits into a plurality of single words; the system has a predetermined pronunciation dictionary ( example For example, a Mandarin pronunciation dictionary, a Cantonese pronunciation dictionary, etc., and a mapping table between a predetermined word, a pronunciation duration, and a pronunciation fundamental frequency, the speech synthesis system splits the sentences and phrases in the text to be synthesized into single words. By searching the mapping table, the pronunciation duration and pronunciation audio corresponding to each single word can be found, and each word can be further divided into preset type speech features according to the predetermined pronunciation dictionary, thereby obtaining the corresponding text to be synthesized. The phonetic features of each word.
  • a predetermined pronunciation dictionary example For example, a Mandarin pronunciation dictionary, a Cantonese pronunciation dictionary, etc.
  • a mapping table between a predetermined word, a pronunciation duration, and a pronunciation fundamental frequency
  • step S20 the preset type acoustic feature vector corresponding to the text to be synthesized is extracted according to the voice features and the pronunciation duration of each word corresponding to the text to be synthesized;
  • the preset type acoustic feature vector is an acoustic and linguistic feature vector
  • the preset type acoustic feature vector includes the acoustic and linguistic feature vectors in Table 1 below, including: factor type, length, pitch , accent position, lip shape, finals
  • step S30 the preset type acoustic feature vector corresponding to the text to be synthesized is input into the trained preset type recognition model, and the voiceprint feature corresponding to the text to be synthesized is identified;
  • the speech synthesis system pre-trains the preset type recognition model.
  • the input type and output feature name of the preset type recognition model can be referred to the above table 1 during the training; the speech synthesis system extracts the preset type acoustic feature vector corresponding to the text to be synthesized. And inputting the extracted preset type acoustic feature vector into the trained preset type recognition model, the recognition model identifying the voiceprint feature corresponding to the text to be synthesized.
  • Step S40 Generate a voice corresponding to the text to be synthesized according to the voiceprint feature corresponding to the text to be synthesized and the pronunciation fundamental frequency of each single word.
  • the speech synthesis system After the speech synthesis system obtains the voiceprint feature corresponding to the text to be synthesized, the speech synthesis system can generate the speech corresponding to the synthesized text according to the obtained voiceprint feature and the pronunciation fundamental frequency of each single word, thus completing the text to be synthesized. Speech synthesis.
  • the phrase and the sentence in the text to be synthesized are first divided into single words, and the pronunciation fundamental frequency, the pronunciation duration and the voice feature corresponding to each single word are determined; then, according to the voice features and pronunciations of the individual words corresponding to the text to be synthesized
  • the preset type acoustic feature vector corresponding to the text to be synthesized is extracted; and the extracted preset type acoustic feature vector is identified by the trained preset type recognition model, thereby identifying the sound corresponding to the text to be synthesized
  • the pattern is obtained; finally, according to the voiceprint feature corresponding to the text to be synthesized and the pronunciation fundamental frequency of each word, the voice corresponding to the text to be synthesized is generated.
  • the solution of the embodiment identifies the voiceprint feature corresponding to the text to be synthesized by using the trained preset type recognition model, and the preset type recognition model is The data has been trained to be completed by a large amount of data. Therefore, the accuracy of the voiceprint feature corresponding to the text to be synthesized is high, and then, according to the voiceprint feature corresponding to the text to be synthesized and the pronunciation fundamental frequency of each word, the generated The speech corresponding to the synthesized text is better in naturalness and clarity, and is not easy to make mistakes.
  • the preset type recognition model is a deep feedforward network model (DNN), and the deep feedforward network model is a five-layer neural network, and the neural nodes of each layer
  • the numbers are: 136L-75N-25S-75N-25L, L means using Linear Activation Function, N means using tanh Tangent Activation Function, and S means using sigmoid activation function.
  • the training process of the preset type recognition model is as follows:
  • Step E1 acquiring a preset number of training texts and corresponding training voices
  • the preset number is 100,000, that is, 100,000 training texts and training speech corresponding to the 100,000 training texts are obtained.
  • the training text includes, but is not limited to, a single word, a phrase, a sentence of Mandarin Chinese; for example, the training text may further include English letters, phrases, sentences, and the like.
  • step E2 the sentences and phrases in each training text are split into single words, and each single word is split into preset type voice features according to a predetermined pronunciation dictionary, and the voice features of each single word corresponding to each training text are determined;
  • the speech synthesis system first splits the sentences and phrases in each training text into single words, and then splits each single word into preset type speech features through a predetermined pronunciation dictionary in the speech synthesis system, thereby determining each training text correspondingly.
  • Step E3 according to the mapping relationship between the predetermined word and the length of the pronunciation, Determining the length of the pronunciation corresponding to each single word, and extracting the preset type acoustic feature vector corresponding to each training text according to the voice features and the length of the pronunciation of each single word corresponding to each training text;
  • the speech synthesis system has a mapping table between the single word and the length of the pronunciation, according to the mapping table, the pronunciation duration of each single word corresponding to each training text can be queried; after determining the pronunciation duration of each single word corresponding to each training text, The speech synthesis system extracts the preset type acoustic feature vectors corresponding to the respective training texts according to the speech features and the pronunciation duration of each single word corresponding to each training text.
  • the preset type acoustic feature vector is an acoustic and linguistic feature vector, and the preset type acoustic feature vector specifically includes the acoustic and linguistic feature vectors in Table 1 above.
  • Step E4 processing each training speech by using a preset filter to extract a preset type of voiceprint feature of each training voice, and according to the mapping relationship between the training text and the training voice, the acoustic feature vector of each training text and the corresponding training The voiceprint features of the voice are correlated to obtain associated data of the acoustic feature vector and the voiceprint feature;
  • the preset filter is, for example, a Mel filter.
  • the speech synthesis system processes the training speech corresponding to each training text by using the preset filter to extract a preset type of voiceprint feature of each training voice, and then, according to the mapping relationship between the training text and the training voice, each training text is The acoustic feature vector is associated with the voiceprint feature of the corresponding training speech to obtain correlation data between the acoustic feature vector and the voiceprint feature.
  • the preset type voiceprint feature may be a Mel Frequency Cepstrum Coefficient (MFCC), and all coefficients of the training voice correspond to one feature matrix.
  • MFCC Mel Frequency Cepstrum Coefficient
  • Step E5 the associated data is divided into a first percentage training set and a second percentage verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;
  • the training set and the verification set respectively occupy a first percentage and a second percentage of the associated data
  • the sum of the first percentage and the second percentage is less than or equal to 100%, that is, the entire associated data may be just divided into the training set and the verification set, or part of the associated data may be Divided into the training set and the validation set; for example, the first percentage is 65% and the second percentage is 30%.
  • Step E6 training the preset type recognition model by using the correlation data between the acoustic feature vector and the voiceprint feature in the training set, and using the verification set to perform the accuracy of the preset type recognition model of the training after the training is completed. verification;
  • the system trains the preset type recognition model by using the associated data of the acoustic feature vector and the voiceprint feature in the training set. After the training of the preset type recognition model is completed, the preset type recognition model is passed through the verification set. Verify the accuracy.
  • step E7 if the accuracy rate is greater than the preset threshold, the model training ends;
  • the verification set verifies the preset type recognition model, the accuracy rate obtained is super After the preset threshold (for example, 98.5%), the training effect of the preset type recognition model has reached the expected standard, and the model training is ended, and the speech synthesis system can apply the preset type recognition model of the training. .
  • the preset threshold for example, 98.5%
  • Step E8 If the accuracy rate is less than or equal to the preset threshold, increase the number of the training text and the corresponding training voice, and re-execute the foregoing steps E2, E3, E4, E5, and E6 based on the added training text and the corresponding training voice. .
  • the obtained accuracy rate is less than or equal to the preset threshold, indicating that the training effect of the preset type identification model has not reached the expected standard, and the number of training sets may not be sufficient or verified.
  • the number of sets is not enough, so in this case, increase the number of training texts and corresponding training voices (for example, increase the fixed number each time or increase the random number each time), and then re-execute the above step E2 based on this. , E3, E4, E5, and E6, are executed in this loop until the requirement of step E7 is reached, and the model training is ended.
  • the preset filter is a Mel filter (Mel filter); in the step E4, each training voice is processed by using a preset filter to extract a preset type voiceprint of each training voice.
  • the steps of the feature include:
  • each training speech is pre-emphasized, framing and windowing; wherein pre-emphasis is to compensate the high-frequency components of the training speech.
  • each window of each training speech is subjected to Fourier transform (ie, FFT transform) to obtain a corresponding spectrum.
  • Fourier transform ie, FFT transform
  • the obtained spectrum is obtained by a Mel filter to obtain a Mel spectrum
  • the spectrum obtained by the Fourier transform is then passed through a Mel filter, thus obtaining the Mel spectrum.
  • the cepstrum analysis is performed on the Mel spectrum to obtain the Mel frequency cepstrum coefficient MFCC, which is the voiceprint feature of the speech of the frame.
  • the cepstrum analysis of this embodiment includes taking logarithm and inverse transform.
  • the actual inverse transform is generally implemented by DCT discrete cosine transform, and the second to thirteenth coefficients after DCT are taken as MFCC coefficients.
  • the application also proposes a speech synthesis system.
  • FIG. 3 is a schematic diagram of an operating environment of a preferred embodiment of the speech synthesis system 10 of the present application.
  • the speech synthesis system 10 is installed and operated in the electronic device 1.
  • the electronic device 1 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a server.
  • the electronic device 1 may include, but is not limited to, a memory 11, a processor 12, and a display 13.
  • Figure 3 shows only the electronic device 1 with components 11-13, but it should be understood that not all illustrated components may be implemented, alternative implementations may be more or less s component.
  • the memory 11 is a computer storage medium, which in some embodiments may be an internal storage unit of the electronic device 1, such as a hard disk or memory of the electronic device 1.
  • the memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in hard disk equipped on the electronic device 1, a smart memory card (SMC), and a secure digital (SD). Card, flash card, etc. Further, the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device.
  • the memory 11 is used to store application software installed in the electronic device 1 and various types of data, such as program codes of the speech synthesis system 10.
  • the memory 11 can also be used to temporarily store data that has been output or is about to be output.
  • the processor 12 in some embodiments, may be a Central Processing Unit (CPU), microprocessor or other data processing chip for running program code or processing data stored in the memory 11, such as executing a speech synthesis system. 10 and so on.
  • CPU Central Processing Unit
  • microprocessor or other data processing chip for running program code or processing data stored in the memory 11, such as executing a speech synthesis system. 10 and so on.
  • the display 13 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch sensor, or the like in some embodiments.
  • the display 13 is for displaying information processed in the electronic device 1 and a user interface for displaying visualization, such as a business customization interface or the like.
  • the components 11-13 of the electronic device 1 communicate with one another via a system bus.
  • FIG. 4 is a program block diagram of a preferred embodiment of the speech synthesis system 10 of the present application.
  • the speech synthesis system 10 can be divided into one or more modules, one or more modules being stored in the memory 11 and being executed by one or more processors (the processor 12 in this embodiment). Execute to complete this application.
  • the speech synthesis system 10 can be segmented into a determination module 101, an extraction module 102, an identification module 103, and a generation module 104.
  • a module referred to in the present application refers to a series of computer program instruction segments capable of performing a specific function, and is more suitable than the program for describing the execution process of the speech synthesis system 10 in the electronic device 1, wherein:
  • the determining module 101 is configured to split the sentence and the phrase in the text to be synthesized into a single word after receiving the text to be synthesized for the speech synthesis, according to the predetermined single word, the pronunciation duration, and the pronunciation fundamental frequency. a mapping relationship, determining a pronunciation duration and a pronunciation fundamental frequency corresponding to each single word, and dividing each single word into a preset type of speech feature according to a predetermined pronunciation dictionary, and determining a speech feature of each single word corresponding to the to-be-synthesized text;
  • the pronunciation fundamental frequency and the pronunciation duration (ie, the sound length) of the single word may be determined by a pre-trained model, such as by a pre-trained Hidden Markov Model (HMM); the preset type Speech features, for example, may include syllables, phonemes, initials, and finals.
  • HMM Hidden Markov Model
  • the speech synthesis system After receiving the text to be synthesized for speech synthesis, the speech synthesis system splits the text sentence and the phrase in the text to be synthesized, and splits into a plurality of single words; the system has a predetermined pronunciation dictionary (for example, the Mandarin pronunciation dictionary, the Cantonese pronunciation dictionary, etc.) as well as predetermined words, pronunciation duration, The mapping table between the three basic frequencies of the pronunciation, the speech synthesis system splits the sentences and phrases in the text to be synthesized into single words, and then finds the pronunciation duration and pronunciation audio corresponding to each single word by searching the mapping table. And subdividing each word into a preset type of voice feature according to the predetermined pronunciation dictionary, thereby obtaining a voice feature of each word corresponding to the text to be synthesized.
  • a predetermined pronunciation dictionary For example, the Mandarin pronunciation dictionary, the Cantonese pronunciation dictionary, etc.
  • the extraction module 102 is configured to extract a preset type acoustic feature vector corresponding to the text to be synthesized according to the voice features and the pronunciation duration of each single word corresponding to the text to be synthesized;
  • the preset type acoustic feature vector is an acoustic and linguistic feature vector
  • the preset type acoustic feature vector includes the acoustic and linguistic feature vectors in Table 2 below, including: factor type, length, pitch , accent position, lip shape, finals
  • the identification module 103 is configured to input the preset type acoustic feature vector corresponding to the text to be synthesized into the trained preset type recognition model, and identify the voiceprint feature corresponding to the text to be synthesized;
  • the speech synthesis system pre-trains the preset type recognition model.
  • the input type and output feature name of the preset type recognition model can be referred to the above table 1 during the training; the speech synthesis system extracts the preset type acoustic feature vector corresponding to the text to be synthesized. And inputting the extracted preset type acoustic feature vector into the trained preset type recognition model, the recognition model identifying the voiceprint feature corresponding to the text to be synthesized.
  • a generating module 104 configured to perform voiceprint features and individual orders according to the text to be synthesized
  • the pronunciation of the word is based on the fundamental frequency, and the speech corresponding to the text to be synthesized is generated.
  • the speech synthesis system After the speech synthesis system obtains the voiceprint feature corresponding to the text to be synthesized, the speech synthesis system can generate the speech corresponding to the synthesized text according to the obtained voiceprint feature and the pronunciation fundamental frequency of each single word, thus completing the text to be synthesized. Speech synthesis.
  • the phrase and the sentence in the text to be synthesized are first divided into single words, and the pronunciation fundamental frequency, the pronunciation duration and the voice feature corresponding to each single word are determined; then, according to the voice features and pronunciations of the individual words corresponding to the text to be synthesized
  • the preset type acoustic feature vector corresponding to the text to be synthesized is extracted; and the extracted preset type acoustic feature vector is identified by the trained preset type recognition model, thereby identifying the sound corresponding to the text to be synthesized
  • the pattern is obtained; finally, according to the voiceprint feature corresponding to the text to be synthesized and the pronunciation fundamental frequency of each word, the voice corresponding to the text to be synthesized is generated.
  • the solution of the embodiment identifies the voiceprint feature corresponding to the text to be synthesized by using the trained preset type recognition model, and the preset type recognition model is The data has been trained to be completed by a large amount of data. Therefore, the accuracy of the voiceprint feature corresponding to the text to be synthesized is high, and then, according to the voiceprint feature corresponding to the text to be synthesized and the pronunciation fundamental frequency of each word, the generated The speech corresponding to the synthesized text is better in naturalness and clarity, and is not easy to make mistakes.
  • the preset type recognition model is a deep feedforward network model (DNN), and the deep feedforward network model is a five-layer neural network, and the neural nodes of each layer
  • the numbers are: 136L-75N-25S-75N-25L, L means using Linear Activation Function, N means using tanh Tangent Activation Function, and S means using sigmoid activation function.
  • the training process of the preset type recognition model in this embodiment is as follows:
  • Step E1 acquiring a preset number of training texts and corresponding training voices
  • the preset number is 100,000, that is, 100,000 training texts and training speech corresponding to the 100,000 training texts are obtained.
  • the training text includes, but is not limited to, a single word, a phrase, a sentence of Mandarin Chinese; for example, the training text may further include English letters, phrases, sentences, and the like.
  • step E2 the sentences and phrases in each training text are split into single words, and each single word is split into preset type voice features according to a predetermined pronunciation dictionary, and the voice features of each single word corresponding to each training text are determined;
  • the speech synthesis system first splits the sentences and phrases in each training text into single words, and then splits each single word into preset type speech features through a predetermined pronunciation dictionary in the speech synthesis system, thereby determining each training text correspondingly.
  • step E3 according to the mapping relationship between the predetermined word and the length of the pronunciation, the length of the pronunciation corresponding to each word is determined, and the words of each word corresponding to each training text are determined. a sound feature and a length of pronunciation, and extracting a preset type acoustic feature vector corresponding to each training text;
  • the speech synthesis system has a mapping table between the single word and the length of the pronunciation, according to the mapping table, the pronunciation duration of each single word corresponding to each training text can be queried; after determining the pronunciation duration of each single word corresponding to each training text, The speech synthesis system extracts the preset type acoustic feature vectors corresponding to the respective training texts according to the speech features and the pronunciation duration of each single word corresponding to each training text.
  • the preset type acoustic feature vector is an acoustic and linguistic feature vector, and the preset type acoustic feature vector specifically includes the acoustic and linguistic feature vectors in Table 2 above.
  • Step E4 processing each training speech by using a preset filter to extract a preset type of voiceprint feature of each training voice, and according to the mapping relationship between the training text and the training voice, the acoustic feature vector of each training text and the corresponding training The voiceprint features of the voice are correlated to obtain associated data of the acoustic feature vector and the voiceprint feature;
  • the preset filter is, for example, a Mel filter.
  • the speech synthesis system processes the training speech corresponding to each training text by using the preset filter to extract a preset type of voiceprint feature of each training voice, and then, according to the mapping relationship between the training text and the training voice, each training text is The acoustic feature vector is associated with the voiceprint feature of the corresponding training speech to obtain correlation data between the acoustic feature vector and the voiceprint feature.
  • the preset type voiceprint feature may be a Mel Frequency Cepstrum Coefficient (MFCC), and all coefficients of the training voice correspond to one feature matrix.
  • MFCC Mel Frequency Cepstrum Coefficient
  • Step E5 the associated data is divided into a first percentage training set and a second percentage verification set, and the sum of the first percentage and the second percentage is less than or equal to 100%;
  • the training set and the verification set respectively occupy a first percentage and a second percentage of the associated data
  • the sum of the first percentage and the second percentage is less than or equal to 100%, that is, the entire associated data may be just divided into the training set and the verification set, or part of the associated data may be Divided into the training set and the validation set; for example, the first percentage is 65% and the second percentage is 30%.
  • Step E6 training the preset type recognition model by using the correlation data between the acoustic feature vector and the voiceprint feature in the training set, and using the verification set to perform the accuracy of the preset type recognition model of the training after the training is completed. verification;
  • the system trains the preset type recognition model by using the associated data of the acoustic feature vector and the voiceprint feature in the training set. After the training of the preset type recognition model is completed, the preset type recognition model is passed through the verification set. Verify the accuracy.
  • step E7 if the accuracy rate is greater than the preset threshold, the model training ends;
  • the verification of the preset type recognition model by the verification set exceeds a preset threshold (for example, 98.5%)
  • a preset threshold for example, 98.5%
  • Step E8 If the accuracy rate is less than or equal to the preset threshold, increase the number of the training text and the corresponding training voice, and re-execute the foregoing steps E2, E3, E4, E5, and E6 based on the added training text and the corresponding training voice. .
  • the obtained accuracy rate is less than or equal to the preset threshold, indicating that the training effect of the preset type identification model has not reached the expected standard, and the number of training sets may not be sufficient or verified.
  • the number of sets is not enough, so in this case, increase the number of training texts and corresponding training voices (for example, increase the fixed number each time or increase the random number each time), and then re-execute the above step E2 based on this. , E3, E4, E5, and E6, are executed in this loop until the requirement of step E7 is reached, and the model training is ended.
  • the preset filter is a Mel filter (Mel filter); in the above step E4, each training speech is processed by using a preset filter to extract a preset type of voiceprint feature of each training voice.
  • the steps include:
  • each training speech is pre-emphasized, framing and windowing; wherein pre-emphasis is to compensate the high-frequency components of the training speech.
  • each window of each training speech is subjected to Fourier transform (ie, FFT transform) to obtain a corresponding spectrum.
  • Fourier transform ie, FFT transform
  • the obtained spectrum is obtained by a Mel filter to obtain a Mel spectrum
  • the spectrum obtained by the Fourier transform is then passed through a Mel filter, thus obtaining the Mel spectrum.
  • the cepstrum analysis is performed on the Mel spectrum to obtain the Mel frequency cepstrum coefficient MFCC, which is the voiceprint feature of the speech of the frame.
  • the cepstrum analysis of this embodiment includes taking logarithm and inverse transform.
  • the actual inverse transform is generally implemented by DCT discrete cosine transform, and the second to thirteenth coefficients after DCT are taken as MFCC coefficients.
  • the present application also provides a computer readable storage medium storing a speech synthesis system, the speech synthesis system being executable by at least one processor to cause the at least one processor to perform any of the above implementations
  • the speech synthesis method in the example is also provided.

Abstract

La présente invention concerne un appareil électronique, un procédé de synthèse vocale et un support de stockage. Le procédé comprend les étapes suivantes : lors de la réception d'un texte à synthétiser, division des phrases et des expressions du texte à synthétiser en mots, détermination, selon une relation de mise en correspondance prédéterminée entre des mots, des durées de prononciation et des fréquences fondamentales de prononciation, une durée de prononciation et une fréquence fondamentale de prononciation correspondant à chacun des mots, et catégorisation, selon un dictionnaire de prononciation prédéterminé, des mots respectifs en types de caractéristique vocale prédéterminés ; extraction, en fonction de la caractéristique vocale et de la durée de prononciation de chaque mot, d'un type prédéterminé de vecteur de caractéristique acoustique correspondant au texte à synthétiser ; entrée, dans un modèle d'identification de type prédéterminé formé, du type prédéterminé de vecteur de caractéristique acoustique correspondant au texte à synthétiser, et identification d'une caractéristique d'empreinte vocale du texte à synthétiser ; et génération, selon la caractéristique d'empreinte vocale identifiée et les fréquences fondamentales de prononciation des mots, de la parole correspondant au texte à synthétiser. La solution technique selon la présente invention permet d'obtenir des résultats de synthèse vocale très précis, naturels et clairs.
PCT/CN2017/108766 2017-09-25 2017-10-31 Appareil électronique, procédé de synthèse vocale, et support de stockage lisible par ordinateur WO2019056500A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710874876.2A CN107564511B (zh) 2017-09-25 2017-09-25 电子装置、语音合成方法和计算机可读存储介质
CN201710874876.2 2017-09-25

Publications (1)

Publication Number Publication Date
WO2019056500A1 true WO2019056500A1 (fr) 2019-03-28

Family

ID=60982768

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/108766 WO2019056500A1 (fr) 2017-09-25 2017-10-31 Appareil électronique, procédé de synthèse vocale, et support de stockage lisible par ordinateur

Country Status (2)

Country Link
CN (1) CN107564511B (fr)
WO (1) WO2019056500A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210390943A1 (en) * 2020-06-15 2021-12-16 Beijing Baidu Netcom Science And Technology Co., Ltd. Method And Apparatus For Training Model, Method And Apparatus For Synthesizing Speech, Device And Storage Medium

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108630190B (zh) * 2018-05-18 2019-12-10 百度在线网络技术(北京)有限公司 用于生成语音合成模型的方法和装置
CN109346056B (zh) * 2018-09-20 2021-06-11 中国科学院自动化研究所 基于深度度量网络的语音合成方法及装置
CN109584859A (zh) * 2018-11-07 2019-04-05 上海指旺信息科技有限公司 语音合成方法及装置
CN109754778B (zh) 2019-01-17 2023-05-30 平安科技(深圳)有限公司 文本的语音合成方法、装置和计算机设备
CN110164413B (zh) * 2019-05-13 2021-06-04 北京百度网讯科技有限公司 语音合成方法、装置、计算机设备和存储介质
CN110767210A (zh) * 2019-10-30 2020-02-07 四川长虹电器股份有限公司 一种生成个性化语音的方法及装置
CN111161705B (zh) * 2019-12-19 2022-11-18 寒武纪(西安)集成电路有限公司 语音转换方法及装置
CN111091807B (zh) * 2019-12-26 2023-05-26 广州酷狗计算机科技有限公司 语音合成方法、装置、计算机设备及存储介质
CN111508469A (zh) * 2020-04-26 2020-08-07 北京声智科技有限公司 一种文语转换方法及装置
CN111429923B (zh) * 2020-06-15 2020-09-29 深圳市友杰智新科技有限公司 说话人信息提取模型的训练方法、装置和计算机设备
CN111968616A (zh) * 2020-08-19 2020-11-20 浙江同花顺智能科技有限公司 一种语音合成模型的训练方法、装置、电子设备和存储介质
CN112184858B (zh) 2020-09-01 2021-12-07 魔珐(上海)信息科技有限公司 基于文本的虚拟对象动画生成方法及装置、存储介质、终端
CN112184859B (zh) * 2020-09-01 2023-10-03 魔珐(上海)信息科技有限公司 端到端的虚拟对象动画生成方法及装置、存储介质、终端
CN113838450B (zh) * 2021-08-11 2022-11-25 北京百度网讯科技有限公司 音频合成及相应的模型训练方法、装置、设备及存储介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050055207A1 (en) * 2000-03-31 2005-03-10 Canon Kabushiki Kaisha Speech information processing method and apparatus and storage medium using a segment pitch pattern model
CN101710488A (zh) * 2009-11-20 2010-05-19 安徽科大讯飞信息科技股份有限公司 语音合成方法及装置
CN101894547A (zh) * 2010-06-30 2010-11-24 北京捷通华声语音技术有限公司 一种语音合成方法和系统

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101000765B (zh) * 2007-01-09 2011-03-30 黑龙江大学 基于韵律特征的语音合成方法
JP5025550B2 (ja) * 2008-04-01 2012-09-12 株式会社東芝 音声処理装置、音声処理方法及びプログラム
CN104538024B (zh) * 2014-12-01 2019-03-08 百度在线网络技术(北京)有限公司 语音合成方法、装置及设备

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050055207A1 (en) * 2000-03-31 2005-03-10 Canon Kabushiki Kaisha Speech information processing method and apparatus and storage medium using a segment pitch pattern model
CN101710488A (zh) * 2009-11-20 2010-05-19 安徽科大讯飞信息科技股份有限公司 语音合成方法及装置
CN101894547A (zh) * 2010-06-30 2010-11-24 北京捷通华声语音技术有限公司 一种语音合成方法和系统

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210390943A1 (en) * 2020-06-15 2021-12-16 Beijing Baidu Netcom Science And Technology Co., Ltd. Method And Apparatus For Training Model, Method And Apparatus For Synthesizing Speech, Device And Storage Medium
US11769480B2 (en) * 2020-06-15 2023-09-26 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for training model, method and apparatus for synthesizing speech, device and storage medium

Also Published As

Publication number Publication date
CN107564511B (zh) 2018-09-11
CN107564511A (zh) 2018-01-09

Similar Documents

Publication Publication Date Title
WO2019056500A1 (fr) Appareil électronique, procédé de synthèse vocale, et support de stockage lisible par ordinateur
Ma et al. Short utterance based speech language identification in intelligent vehicles with time-scale modifications and deep bottleneck features
US9431011B2 (en) System and method for pronunciation modeling
US8321218B2 (en) Searching in audio speech
CN109686383B (zh) 一种语音分析方法、装置及存储介质
US20160049144A1 (en) System and method for unified normalization in text-to-speech and automatic speech recognition
Stan et al. TUNDRA: a multilingual corpus of found data for TTS research created with light supervision
JP2001282282A (ja) 音声情報処理方法および装置および記憶媒体
Kirchner et al. Computing phonological generalization over real speech exemplars
Black et al. Automated evaluation of non-native English pronunciation quality: combining knowledge-and data-driven features at multiple time scales
Qian et al. Capturing L2 segmental mispronunciations with joint-sequence models in computer-aided pronunciation training (CAPT)
Bhatt et al. Continuous speech recognition technologies—a review
Miodonska et al. Dynamic time warping in phoneme modeling for fast pronunciation error detection
Meng et al. Synthesizing English emphatic speech for multimodal corrective feedback in computer-aided pronunciation training
Oura et al. Unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis
Toyama et al. Use of Global and Acoustic Features Associated with Contextual Factors to Adapt Language Models for Spontaneous Speech Recognition.
Ekpenyong et al. Improved syllable-based text to speech synthesis for tone language systems
Giwa et al. The effect of language identification accuracy on speech recognition accuracy of proper names
Park et al. Jejueo datasets for machine translation and speech synthesis
CN112329484A (zh) 自然语言的翻译方法及装置
AbuZeina et al. Arabic speech recognition systems
Heo et al. Classification based on speech rhythm via a temporal alignment of spoken sentences
US20230215421A1 (en) End-to-end neural text-to-speech model with prosody control
Boháč et al. Automatic syllabification and syllable timing of automatically recognized speech–for czech
Gonzalvo et al. Text-to-speech with cross-lingual neural network-based grapheme-to-phoneme models.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17926126

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 24.09.2020)

122 Ep: pct application non-entry in european phase

Ref document number: 17926126

Country of ref document: EP

Kind code of ref document: A1