CN109036377A - A kind of phoneme synthesizing method and device - Google Patents

A kind of phoneme synthesizing method and device Download PDF

Info

Publication number
CN109036377A
CN109036377A CN201810834892.3A CN201810834892A CN109036377A CN 109036377 A CN109036377 A CN 109036377A CN 201810834892 A CN201810834892 A CN 201810834892A CN 109036377 A CN109036377 A CN 109036377A
Authority
CN
China
Prior art keywords
speech
phoneme
feature vector
voice
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810834892.3A
Other languages
Chinese (zh)
Inventor
何树民
徐文韬
陈玉玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Unionpay Co Ltd
Original Assignee
China Unionpay Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Unionpay Co Ltd filed Critical China Unionpay Co Ltd
Priority to CN201810834892.3A priority Critical patent/CN109036377A/en
Publication of CN109036377A publication Critical patent/CN109036377A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

The present invention provides a kind of phoneme synthesizing method and device, is related to field of computer technology, method comprises determining that the corresponding aligned phoneme sequence of text information to sounding;The aligned phoneme sequence includes multiple phoneme informations, and the sequence of each phoneme information is consistent with the sequence of each text in the text information;The phoneme information includes the initial consonant, simple or compound vowel of a Chinese syllable and tone of the corresponding text of phoneme information;The aligned phoneme sequence is inputted into speech utterance model, determines the corresponding speech feature vector of the text information, the speech utterance model is to carry out neural metwork training to sounding sample to obtain;The speech feature vector by playing device for being played out.Due to considering the initial consonant of Chinese sounding, the relationship of simple or compound vowel of a Chinese syllable and tone, the sound simulated has higher authenticity, and can be adapted for the various dialects being made of phoneme and other languages, has very high scalability.

Description

Voice synthesis method and device
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for speech synthesis.
Background
Speech synthesis is a technique for generating artificial speech by mechanical, electronic methods. TTS technology (also known as text-to-speech technology) belongs to speech synthesis, and is a technology for converting text information generated by a computer or input from the outside into intelligible and fluent chinese spoken language and outputting the same.
The current speech synthesis techniques in the prior art include: the first method is a speech synthesis technology based on rule synthesis. This synthesis generates the target speech by phonetic rules. The rule synthesis system stores acoustic parameters of smaller speech units (e.g., phonemes, diphones, demi-syllables, or syllables), and rules for composing syllables from phonemes and words or sentences from syllables. And the second method is a voice synthesis technology based on waveform splicing. The synthesis mode uses sentences, phrases, words or syllables as synthesis units, these units are respectively recorded and then directly digitally coded, and then are undergone the process of proper data compression to form a synthetic speech library, and when it is replayed, according to the information to be outputted, the waveform data of correspondent unit can be taken out from corpus, and can be series-connected or edited together, and then the speech can be restored by means of decoding. And the third method is a speech synthesis technology based on parameter analysis and synthesis. Such synthesis methods mostly use syllables, demisyllables or phones as synthesis units.
However, the method one requires very complicated rules to be set, and different environments and different contexts are analyzed to set different rules. Meanwhile, the synthesized voice has low naturalness and cannot be widely applied. The second method and the third method can synthesize high-quality voice only by enough high-quality speaker records, the used voice library file is too large, and the multi-syllable characters cannot be solved at all. Synthesis cannot be performed by sound obtained from public places.
In summary, the prior art cannot provide a speech synthesis method with simple rules and high speech naturalness.
Disclosure of Invention
The invention provides a voice synthesis method and a voice synthesis device, which are used for solving the problem that the prior art can not provide a voice synthesis method with simple rules and high voice naturalness.
The embodiment of the invention provides a voice synthesis method, which comprises the following steps: determining a phoneme sequence corresponding to text information to be sounded; the phoneme sequence comprises a plurality of phoneme information, and the sequencing of each phoneme information is consistent with the sequencing of each character in the text information; the phoneme information comprises initials, finals and tones of characters corresponding to the phoneme information;
inputting the phoneme sequence into a voice sounding model, and determining a voice feature vector corresponding to the text information, wherein the voice sounding model is obtained by carrying out neural network training on a sounding sample; the voice feature vector is used for playing through a playing device.
In the embodiment of the invention, the text information to be sounded is arranged according to the arrangement mode of initials, finals and tones of characters in the text information. In the embodiment of the invention, because the relation among the initial consonant, the final sound and the tone of Chinese pronunciation is considered, the simulated sound has higher authenticity, and can be suitable for various dialects and other languages formed by phonemes, and has very high expansibility.
Further, the inputting the phoneme sequence into a speech sound generation model and determining a speech feature vector corresponding to the text information includes:
inputting speech parameters and the phoneme sequence into the speech sound generation model, wherein the speech parameters are used for indicating the identification of a speaker; the voice sound production model is obtained by carrying out neural network training on sound production samples of all speakers;
and determining the voice feature vector of the speaker corresponding to the text information through the voice production model.
In the embodiment of the invention, the voice parameters are input into the voice sound production model, the obtained voice feature vector is the voice content corresponding to the voice of the sound producer corresponding to the voice parameters, and the voice of the sound producer is used for carrying out sound production. In the embodiment of the invention, the voice of any speaker trained in the voice sounding model can be used for broadcasting.
Further, before inputting the speech parameters and the phoneme sequence into the speech sound generation model, the method further includes:
acquiring random noise, wherein the random noise conforms to normal distribution;
the inputting of the speech parameters and the phoneme sequence into the speech utterance model includes:
inputting the speech parameters, the random noise and the phoneme sequence into the speech utterance model.
In the embodiment of the invention, in order to prevent the voice feature vector output by the voice sound production model from being over-fitted, random noise conforming to normal distribution is added, so that the output voice feature vector is more accurate.
Further, the speech feature vector comprises a 60-dimensional Mel generalized cepstrum vector, a 1-dimensional aperiodic banded feature vector, a 1-dimensional log fundamental frequency feature vector and a 1-dimensional unvoiced and voiced feature vector.
In the embodiment of the invention, the selected 63-dimensional vector can better simulate Chinese pronunciation and personal pronunciation, so that the voice characteristic vector can better accord with Chinese pronunciation rules, and the voice can be better simulated to generate voice.
Further, the speech sound production model is a neural network model of a working memory mechanism.
In the embodiment of the invention, because the phoneme of each character in Chinese pronunciation has no incidence relation with the previous character, the neural network model of a working memory mechanism is used, the calculation amount of the neural network can be reduced, and the data amount of the model is simplified.
An embodiment of the present invention further provides a speech synthesis apparatus, including:
the phoneme sequence determining unit is used for determining a phoneme sequence corresponding to the text information to be sounded; the phoneme sequence comprises a plurality of phoneme information, and the sequencing of each phoneme information is consistent with the sequencing of each character in the text information; the phoneme information comprises initials, finals and tones of characters corresponding to the phoneme information;
the voice feature vector determining unit is used for inputting the phoneme sequence into a voice production model and determining a voice feature vector corresponding to the text information, wherein the voice production model is obtained by carrying out neural network training on a sound production sample; the voice feature vector is used for playing through a playing device.
In the embodiment of the invention, the text information to be sounded is arranged according to the arrangement mode of initials, finals and tones of characters in the text information. In the embodiment of the invention, because the relation among the initial consonant, the final sound and the tone of Chinese pronunciation is considered, the simulated sound has higher authenticity, and can be suitable for various dialects and other languages formed by phonemes, and has very high expansibility.
Further, the speech feature vector determination unit is specifically configured to:
inputting speech parameters and the phoneme sequence into the speech sound generation model, wherein the speech parameters are used for indicating the identification of a speaker; the voice sound production model is obtained by carrying out neural network training on sound production samples of all speakers;
and determining the voice feature vector of the speaker corresponding to the text information through the voice production model.
Further, the phoneme sequence determination unit is further configured to:
acquiring random noise, wherein the random noise conforms to normal distribution;
the speech feature vector determination unit is specifically configured to:
inputting the speech parameters, the random noise and the phoneme sequence into the speech utterance model.
Further, the speech feature vector comprises a 60-dimensional Mel generalized cepstrum vector, a 1-dimensional aperiodic banded feature vector, a 1-dimensional log fundamental frequency feature vector and a 1-dimensional unvoiced and voiced feature vector.
Further, the speech sound production model is a neural network model of a working memory mechanism.
An embodiment of the present invention further provides an electronic device, including:
at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any of the above embodiments.
Embodiments of the present invention also provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of any one of the above embodiments.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
Fig. 1 is a schematic flow chart of a speech synthesis method according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of training a speech utterance model according to an embodiment of the present invention;
FIG. 3 is a diagram of a human speech model according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating a neural network training process according to an embodiment of the present invention;
fig. 5 is a schematic flow chart of a neural network training process according to an embodiment of the present invention;
FIG. 6 is a schematic flow chart of speech synthesis according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The present invention provides a speech synthesis method, as shown in fig. 1, including:
step 101, determining a phoneme sequence corresponding to text information to be sounded; the phoneme sequence comprises a plurality of phoneme information, and the sequencing of each phoneme information is consistent with the sequencing of each character in the text information; the phoneme information comprises initials, finals and tones of characters corresponding to the phoneme information;
step 102, inputting the phoneme sequence into a voice production model, and determining a voice feature vector corresponding to the text information, wherein the voice production model is obtained by performing neural network training on a sound production sample; the voice feature vector is used for playing through a playing device.
In step 101, the text information to be uttered is the text information that needs to be speech-synthesized, in the embodiment of the present invention, the initial consonant, the final sound, and the tone of each character in the text information are in one-to-one correspondence with the phoneme sequence, and then the text information is converted into the phoneme sequence according to the correspondence between the two, and the ordering manner in the phoneme sequence is consistent with the arrangement manner of each character.
For example, in the embodiment of the present invention, each chinese character may be split into an initial consonant, a final consonant, and a tone, the initial consonant corresponds to a certain identifier in the phoneme sequence table, and the final consonant and the tone correspond to a certain identifier in the phoneme sequence table, and then the identifiers are arranged in sequence to form a phoneme sequence.
For example, "hello" is the text information to be uttered, "the pinyin of" hello "is" ni3 ", where" n "is the initial, and" i "is the final, and 3 represents the tone, and in order to better conform to the pronunciation of Chinese, the final and tone are combined together and are" i3 ". Optionally, in the embodiment of the present invention, a phoneme comparison table exists, for example, as shown in table 1:
table 1: phoneme comparison table
Through the phoneme comparison table, the initial consonant, the final sound and the tone of each text message to be sounded can be converted into a phoneme sequence, and the arrangement mode of the phoneme sequence is the same as that of the initial consonant and the final sound of each Chinese character and the sequential arrangement mode of each word.
In step 102, after the phoneme sequence is determined, the phoneme sequence is input into a speech utterance model, and an output result is determined. In the embodiment of the invention, the voice sound generation model is obtained by training a neural network according to the sound generation samples.
Optionally, in the embodiment of the present invention, the speech model is trained according to a speech sample, and the speech sample may be any available sound, including sound on the internet, speech sound, video sound, and the like.
Optionally, in the embodiment of the present invention, the playing device may be a world vocoder, the world vocoder may pronounce according to the determined speech feature vector, and the content of the pronunciation is text information to be pronounced.
Optionally, in step 102, the inputting the phoneme sequence into a speech sound generation model, and determining a speech feature vector corresponding to the text information includes:
inputting speech parameters and the phoneme sequence into the speech sound generation model, wherein the speech parameters are used for indicating the identification of a speaker; the voice sound production model is obtained by carrying out neural network training on training sound production samples of all speakers;
and determining the voice feature vector of the speaker corresponding to the text information through the voice production model.
That is to say, in addition to inputting the phoneme sequence into the speech sound generation model, the obtained speech parameters are also input into the speech sound generation model, that is, in the embodiment of the present invention, the speech sound generation model is not only trained through the sound generation sample, but also needs to be trained by adding the sound information of the speaker corresponding to the sound generation sample, and the obtained sound generation model can not only pronounce accurately, but also can be vocalized through the sound of any trained speaker.
Optionally, in the embodiment of the present invention, the input speech parameter is identification information of a speaker, and a voice of the speaker corresponding to the identification information exists in a training sample used in performing the training of the utterance model.
Optionally, in the embodiment of the present invention, as shown in fig. 2, the speech sound generation model may be trained by the method in fig. 2.
Optionally, in the embodiment of the present invention, the training sample is all audio, including audio information in video, speech, audio content such as audio program, and the like, and may be used as the training sample.
After the audio sample is obtained, the voice information of the speaker needs to be obtained for model training, so that the trained voice sound generation model can learn the pronunciation method of each speaker.
Optionally, in the embodiment of the present invention, the training samples are divided into training data and test data, and whether the training model achieves the expected target is determined according to the audio features of the training samples.
In general, when processing speech signals, the time domain waveform of the sound is not used directly, but some processing of the audio is required. The training sample is preprocessed, and acoustic features of an audio part in the training sample, namely a 63 x n vector, are extracted. Wherein the size of n is proportional to the length of the audio, and the 63-dimensional features mainly include 60-dimensional Mel-frequency cepstral (MGC) parameters: the MGC features are Mel-Frequency Cepstral Coefficients (MFCC) features after dimensionality reduction. Because the MFCC feature latitude is too high and is not suitable for direct training, the dimension reduction processing is carried out on the MFCC feature latitude, and the MGC feature is obtained.
1-dimensional aperiodic Band (BAP) characteristics: an aperiodic excitation signal; 1-dimensional log fundamental frequency lf 0: the fundamental frequency, a periodic pulse train, that characterizes the individual's voice production. 1-dimensional unvoiced/voiced feature v/uv: to represent unvoiced and voiced sounds.
In the embodiment of the present invention, the 63-dimensional vector is selected as the acoustic feature of the training sample because the 63-dimensional vector is similar to the human voice model.
Human voice model as shown in fig. 3, the source excitation portion of fig. 3 corresponds to excitation by the combined action of airflow from the lungs and vocal cords, and the vocal tract resonance portion corresponds to the tuning movement of the vocal tract. Source excitations are classified into voiced and unvoiced classes: the voiced sound is generated by that the airflow passes through the vocal cords, the vocal cords are impacted to generate vibration, and quasi-periodic pulses are generated at the glottis, so that the periodic pulse excitation can be used for representing the voiced sound; similarly, the unvoiced sound band does not vibrate, and the air flow directly enters the sound channel, so that the noise is reduced to white noise. The vocal tract resonance (voicing mode) of a person corresponds to a linear time-invariant system whose response (convolution) to the input signal is the speech uttered by the person. In the above paragraph, f0 corresponds to periodic impulse excitation and BAP corresponds to aperiodic excitation, while the channel resonance h (n) can also be derived by the MGC.
In the embodiment of the invention, the training sample is input into the primary training model to obtain a primary training result, and the primary training result is compared with the audio features of the 63-dimensional training sample, and the primary training model is adjusted until the training is finished to obtain the voice production model.
Optionally, in the embodiment of the present invention, the training speech sound generation model may be a neural network model of a working memory mechanism, and since in chinese pronunciation, pronunciation of each character is only related to a phoneme corresponding to each character and is unrelated to preceding and following characters, in the embodiment of the present invention, the neural network model of the working memory mechanism is adopted to reduce training duration and difficulty.
As shown in fig. 4, when the training sample is "i is a chinese", the determined sequence of the initials, finals and tones is "w, o3, shi4, zh, o1ng, g, u, o2, r, e2 n"; the Speaker ID is identification information of a Speaker who is a Chinese, can be an identification code of the Speaker, the identification code is an identification of the Speaker stored in the voice sound production model, can be a serial number, and can also be an identification number of the Speaker, and the identification code is used as a unique identification of the Speaker.
Inputting the content into a neural network model of a working memory mechanism, updating working memory units, and determining which working memory units need to be memorized and which working memory units need to be deleted.
For example, in the embodiment of the present invention, the system outputs n iterative 63-dimensional vectors, each of which is referred to as an output unit. The whole working memory model consists of k working memory units, and the output unit at the time t is determined by the current Bt. The updating method of B is shown in fig. 5:
step 501, acquiring the text content of the current iteration based on an attention mechanism;
step 502, obtaining Speaker ID of a Speaker, a last iteration output unit and current working memory Bt;
step 503, calculating the updated working memory unit u (the bold frame in the figure) according to the above four quantities;
and step 504, removing B [ k ] in B, translating the rest units to the right, and making B [1] be u to obtain new B.
Optionally, in the embodiment of the present invention, after the voice utterance model is determined, a voice feature vector may be obtained according to the text to be uttered, the identification information of the utterer who needs to utter, and the voice utterance model, and the voice feature vector is input to a playing device, such as a world vocoder, so that the text to be uttered may be played by using the voice of the utterer.
Optionally, in the embodiment of the present invention, in order to prevent the voice production model from being over-fitted, random noise complying with standard normal distribution may be added when determining the voice feature vector, so as to increase the diversity of the voice production model samples, so that the voice production model is not over-fitted, and the obtained voice feature vector is more accurate.
For example, as shown in fig. 6, performing speech synthesis includes several steps, respectively:
step 601, converting a string of Chinese texts to be generated into a series of corresponding phoneme sequences through the corresponding relation between initials, finals, tones and a phoneme table;
step 602, generating a section of random noise which follows standard normal distribution;
step 603, inputting the random noise, the id of the speaker to be simulated and the phoneme sequence into the trained neural network module, and outputting a string of audio acoustic features close to the specified speaker by the trained neural network module;
step 604, inputting the acoustic characteristics of the audio into the WORLD vocoder, wherein the vocoder outputs a section of audio, and the content of the audio is a string of input Chinese texts, and the pronunciation is the pronunciation of a specified speaker.
Based on the same concept, an embodiment of the present invention further provides a speech synthesis apparatus, as shown in fig. 7, including:
a phoneme sequence determining unit 701, configured to determine a phoneme sequence corresponding to text information to be uttered; the phoneme sequence comprises a plurality of phoneme information, and the sequencing of each phoneme information is consistent with the sequencing of each character in the text information; the phoneme information comprises initials, finals and tones of characters corresponding to the phoneme information;
a speech feature vector determining unit 702, configured to input the phoneme sequence into a speech model, and determine a speech feature vector corresponding to the text information, where the speech model is obtained by performing neural network training on a speech sample; the voice feature vector is used for playing through a playing device.
Further, the speech feature vector determining unit 702 is specifically configured to:
inputting speech parameters and the phoneme sequence into the speech sound generation model, wherein the speech parameters are used for indicating the identification of a speaker; the voice sound production model is obtained by carrying out neural network training on training sound production samples of all speakers;
and determining the voice feature vector of the speaker corresponding to the text information through the voice production model.
Further, the phoneme sequence determining unit 701 is further configured to:
acquiring random noise, wherein the random noise conforms to normal distribution;
the speech feature vector determination unit is specifically configured to:
inputting the speech parameters, the random noise and the phoneme sequence into the speech utterance model.
Further, the speech feature vector comprises a 60-dimensional Mel generalized cepstrum vector, a 1-dimensional aperiodic banded feature vector, a 1-dimensional log fundamental frequency feature vector and a 1-dimensional unvoiced and voiced feature vector.
Further, the speech sound production model is a neural network model of a working memory mechanism.
Based on the same principle, the present invention also provides an electronic device, as shown in fig. 8, including:
the system comprises a processor 801, a memory 802, a transceiver 803 and a bus interface 804, wherein the processor 801, the memory 802 and the transceiver 803 are connected through the bus interface 804;
the processor 801 is configured to read the program in the memory 802, and execute the following method:
determining a phoneme sequence corresponding to text information to be sounded; the phoneme sequence comprises a plurality of phoneme information, and the sequencing of each phoneme information is consistent with the sequencing of each character in the text information; the phoneme information comprises initials, finals and tones of characters corresponding to the phoneme information;
inputting the phoneme sequence into a voice sounding model, and determining a voice feature vector corresponding to the text information, wherein the voice sounding model is obtained by carrying out neural network training on a sounding sample; the voice feature vector is used for playing through a playing device.
Further, the processor 801 is specifically configured to:
inputting speech parameters and the phoneme sequence into the speech sound generation model, wherein the speech parameters are used for indicating the identification of a speaker; the voice sound production model is obtained by carrying out neural network training on training sound production samples of all speakers;
and determining the voice feature vector of the speaker corresponding to the text information through the voice production model.
Further, the processor 801 acquires random noise through the transceiver 803, wherein the random noise conforms to a normal distribution;
the processor 801 is specifically configured to:
inputting the speech parameters, the random noise and the phoneme sequence into the speech utterance model.
Further, the speech feature vector comprises a 60-dimensional Mel generalized cepstrum vector, a 1-dimensional aperiodic banded feature vector, a 1-dimensional log fundamental frequency feature vector and a 1-dimensional unvoiced and voiced feature vector.
Further, the speech sound production model is a neural network model of a working memory mechanism.
The present application provides a computer program product including a computer program stored on a non-transitory computer readable storage medium, the computer program including program instructions which, when executed by a computer, cause the computer to perform any one of the above data query methods.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (12)

1. A method of speech synthesis, the method comprising:
determining a phoneme sequence corresponding to text information to be sounded; the phoneme sequence comprises a plurality of phoneme information, and the sequencing of each phoneme information is consistent with the sequencing of each character in the text information; the phoneme information comprises initials, finals and tones of characters corresponding to the phoneme information;
inputting the phoneme sequence into a voice sounding model, and determining a voice feature vector corresponding to the text information, wherein the voice sounding model is obtained by carrying out neural network training on a sounding sample; the voice feature vector is used for playing through a playing device.
2. The method of claim 1, wherein said inputting said sequence of phonemes into a speech utterance model and determining a speech feature vector corresponding to said text information comprises:
inputting speech parameters and the phoneme sequence into the speech sound generation model, wherein the speech parameters are used for indicating the identification of a speaker; the voice sound production model is obtained by carrying out neural network training on sound production samples of all speakers;
and determining the voice feature vector of the speaker corresponding to the text information through the voice production model.
3. The method of claim 2, wherein said inputting speech parameters and said sequence of phonemes into said speech utterance model further comprises:
acquiring random noise, wherein the random noise conforms to normal distribution;
the inputting of the speech parameters and the phoneme sequence into the speech utterance model includes:
inputting the speech parameters, the random noise and the phoneme sequence into the speech utterance model.
4. The method of any one of claims 1 to 3, wherein the speech feature vectors include a 60-dimensional Mel generalized cepstral vector, a 1-dimensional aperiodic banded feature vector, a 1-dimensional log fundamental feature vector, and a 1-dimensional unvoiced feature vector.
5. The method of claim 4, wherein the speech-sounding model is a neural network model of a working memory mechanism.
6. A speech synthesis apparatus, comprising:
the phoneme sequence determining unit is used for determining a phoneme sequence corresponding to the text information to be sounded; the phoneme sequence comprises a plurality of phoneme information, and the sequencing of each phoneme information is consistent with the sequencing of each character in the text information; the phoneme information comprises initials, finals and tones of characters corresponding to the phoneme information;
the voice feature vector determining unit is used for inputting the phoneme sequence into a voice production model and determining a voice feature vector corresponding to the text information, wherein the voice production model is obtained by carrying out neural network training on a sound production sample; the voice feature vector is used for playing through a playing device.
7. The apparatus of claim 6, wherein the speech feature vector determination unit is specifically configured to:
inputting speech parameters and the phoneme sequence into the speech sound generation model, wherein the speech parameters are used for indicating the identification of a speaker; the voice sound production model is obtained by carrying out neural network training on sound production samples of all speakers;
and determining the voice feature vector of the speaker corresponding to the text information through the voice production model.
8. The apparatus of claim 7, wherein the phoneme sequence determination unit is further to:
acquiring random noise, wherein the random noise conforms to normal distribution;
the speech feature vector determination unit is specifically configured to:
inputting the speech parameters, the random noise and the phoneme sequence into the speech utterance model.
9. The apparatus according to any one of claims 6 to 8, wherein the speech feature vectors include a 60-dimensional Mel generalized cepstral vector, a 1-dimensional aperiodic banded feature vector, a 1-dimensional log fundamental frequency feature vector, and a 1-dimensional unvoiced and voiced feature vector.
10. The apparatus of claim 9, wherein the speech production model is a neural network model of a working memory mechanism.
11. An electronic device, comprising:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.
12. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 5.
CN201810834892.3A 2018-07-26 2018-07-26 A kind of phoneme synthesizing method and device Pending CN109036377A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810834892.3A CN109036377A (en) 2018-07-26 2018-07-26 A kind of phoneme synthesizing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810834892.3A CN109036377A (en) 2018-07-26 2018-07-26 A kind of phoneme synthesizing method and device

Publications (1)

Publication Number Publication Date
CN109036377A true CN109036377A (en) 2018-12-18

Family

ID=64645616

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810834892.3A Pending CN109036377A (en) 2018-07-26 2018-07-26 A kind of phoneme synthesizing method and device

Country Status (1)

Country Link
CN (1) CN109036377A (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109785823A (en) * 2019-01-22 2019-05-21 中财颐和科技发展(北京)有限公司 Phoneme synthesizing method and system
CN110070852A (en) * 2019-04-26 2019-07-30 平安科技(深圳)有限公司 Synthesize method, apparatus, equipment and the storage medium of Chinese speech
CN110335608A (en) * 2019-06-17 2019-10-15 平安科技(深圳)有限公司 Voice print verification method, apparatus, equipment and storage medium
CN110619867A (en) * 2019-09-27 2019-12-27 百度在线网络技术(北京)有限公司 Training method and device of speech synthesis model, electronic equipment and storage medium
CN110880327A (en) * 2019-10-29 2020-03-13 平安科技(深圳)有限公司 Audio signal processing method and device
CN110956948A (en) * 2020-01-03 2020-04-03 北京海天瑞声科技股份有限公司 End-to-end speech synthesis method, device and storage medium
CN111583900A (en) * 2020-04-27 2020-08-25 北京字节跳动网络技术有限公司 Song synthesis method and device, readable medium and electronic equipment
CN111653265A (en) * 2020-04-26 2020-09-11 北京大米科技有限公司 Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN111785248A (en) * 2020-03-12 2020-10-16 北京京东尚科信息技术有限公司 Text information processing method and device
CN112151005A (en) * 2020-09-28 2020-12-29 四川长虹电器股份有限公司 Chinese and English mixed speech synthesis method and device
CN112287112A (en) * 2019-07-25 2021-01-29 北京中关村科金技术有限公司 Method, device and storage medium for constructing special pronunciation dictionary
CN113053355A (en) * 2021-03-17 2021-06-29 平安科技(深圳)有限公司 Fole human voice synthesis method, device, equipment and storage medium
WO2021127821A1 (en) * 2019-12-23 2021-07-01 深圳市优必选科技股份有限公司 Speech synthesis model training method, apparatus, computer device, and storage medium
CN113314096A (en) * 2020-02-25 2021-08-27 阿里巴巴集团控股有限公司 Speech synthesis method, apparatus, device and storage medium
CN113421547A (en) * 2021-06-03 2021-09-21 华为技术有限公司 Voice processing method and related equipment
WO2022141710A1 (en) * 2020-12-28 2022-07-07 科大讯飞股份有限公司 Speech synthesis method, apparatus and device, and storage medium
US11417314B2 (en) 2019-09-19 2022-08-16 Baidu Online Network Technology (Beijing) Co., Ltd. Speech synthesis method, speech synthesis device, and electronic apparatus
CN114974208A (en) * 2022-06-20 2022-08-30 青岛大学 Chinese speech synthesis method and device, electronic equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102270449A (en) * 2011-08-10 2011-12-07 歌尔声学股份有限公司 Method and system for synthesising parameter speech
CN103236259A (en) * 2013-03-22 2013-08-07 乐金电子研发中心(上海)有限公司 Voice recognition processing and feedback system, voice response method
CN105956588A (en) * 2016-04-21 2016-09-21 深圳前海勇艺达机器人有限公司 Method of intelligent scanning and text reading and robot device
CN106205601A (en) * 2015-05-06 2016-12-07 科大讯飞股份有限公司 Determine the method and system of text voice unit
JP2017107228A (en) * 2017-02-20 2017-06-15 株式会社テクノスピーチ Singing voice synthesis device and singing voice synthesis method
CN107452408A (en) * 2017-07-27 2017-12-08 上海与德科技有限公司 A kind of audio frequency playing method and device
CN107705783A (en) * 2017-11-27 2018-02-16 北京搜狗科技发展有限公司 A kind of phoneme synthesizing method and device
CN107945786A (en) * 2017-11-27 2018-04-20 北京百度网讯科技有限公司 Phoneme synthesizing method and device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102270449A (en) * 2011-08-10 2011-12-07 歌尔声学股份有限公司 Method and system for synthesising parameter speech
CN103236259A (en) * 2013-03-22 2013-08-07 乐金电子研发中心(上海)有限公司 Voice recognition processing and feedback system, voice response method
CN106205601A (en) * 2015-05-06 2016-12-07 科大讯飞股份有限公司 Determine the method and system of text voice unit
CN105956588A (en) * 2016-04-21 2016-09-21 深圳前海勇艺达机器人有限公司 Method of intelligent scanning and text reading and robot device
JP2017107228A (en) * 2017-02-20 2017-06-15 株式会社テクノスピーチ Singing voice synthesis device and singing voice synthesis method
CN107452408A (en) * 2017-07-27 2017-12-08 上海与德科技有限公司 A kind of audio frequency playing method and device
CN107705783A (en) * 2017-11-27 2018-02-16 北京搜狗科技发展有限公司 A kind of phoneme synthesizing method and device
CN107945786A (en) * 2017-11-27 2018-04-20 北京百度网讯科技有限公司 Phoneme synthesizing method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
凌震华: "《基于统计声学建模的语音合成技术研究》", 《信息科技辑》 *
崔宣: "《基于语音混合特征说话人识别的研究》", 《信息科技辑》 *

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109785823A (en) * 2019-01-22 2019-05-21 中财颐和科技发展(北京)有限公司 Phoneme synthesizing method and system
CN110070852A (en) * 2019-04-26 2019-07-30 平安科技(深圳)有限公司 Synthesize method, apparatus, equipment and the storage medium of Chinese speech
CN110070852B (en) * 2019-04-26 2023-06-16 平安科技(深圳)有限公司 Method, device, equipment and storage medium for synthesizing Chinese voice
WO2020215551A1 (en) * 2019-04-26 2020-10-29 平安科技(深圳)有限公司 Chinese speech synthesizing method, apparatus and device, storage medium
CN110335608A (en) * 2019-06-17 2019-10-15 平安科技(深圳)有限公司 Voice print verification method, apparatus, equipment and storage medium
CN110335608B (en) * 2019-06-17 2023-11-28 平安科技(深圳)有限公司 Voiceprint verification method, voiceprint verification device, voiceprint verification equipment and storage medium
CN112287112A (en) * 2019-07-25 2021-01-29 北京中关村科金技术有限公司 Method, device and storage medium for constructing special pronunciation dictionary
US11417314B2 (en) 2019-09-19 2022-08-16 Baidu Online Network Technology (Beijing) Co., Ltd. Speech synthesis method, speech synthesis device, and electronic apparatus
CN110619867A (en) * 2019-09-27 2019-12-27 百度在线网络技术(北京)有限公司 Training method and device of speech synthesis model, electronic equipment and storage medium
US11488577B2 (en) 2019-09-27 2022-11-01 Baidu Online Network Technology (Beijing) Co., Ltd. Training method and apparatus for a speech synthesis model, and storage medium
CN110619867B (en) * 2019-09-27 2020-11-03 百度在线网络技术(北京)有限公司 Training method and device of speech synthesis model, electronic equipment and storage medium
CN110880327A (en) * 2019-10-29 2020-03-13 平安科技(深圳)有限公司 Audio signal processing method and device
WO2021127821A1 (en) * 2019-12-23 2021-07-01 深圳市优必选科技股份有限公司 Speech synthesis model training method, apparatus, computer device, and storage medium
CN110956948A (en) * 2020-01-03 2020-04-03 北京海天瑞声科技股份有限公司 End-to-end speech synthesis method, device and storage medium
CN113314096A (en) * 2020-02-25 2021-08-27 阿里巴巴集团控股有限公司 Speech synthesis method, apparatus, device and storage medium
CN111785248B (en) * 2020-03-12 2023-06-23 北京汇钧科技有限公司 Text information processing method and device
CN111785248A (en) * 2020-03-12 2020-10-16 北京京东尚科信息技术有限公司 Text information processing method and device
CN111653265A (en) * 2020-04-26 2020-09-11 北京大米科技有限公司 Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN111653265B (en) * 2020-04-26 2023-08-18 北京大米科技有限公司 Speech synthesis method, device, storage medium and electronic equipment
CN111583900B (en) * 2020-04-27 2022-01-07 北京字节跳动网络技术有限公司 Song synthesis method and device, readable medium and electronic equipment
CN111583900A (en) * 2020-04-27 2020-08-25 北京字节跳动网络技术有限公司 Song synthesis method and device, readable medium and electronic equipment
CN112151005B (en) * 2020-09-28 2022-08-19 四川长虹电器股份有限公司 Chinese and English mixed speech synthesis method and device
CN112151005A (en) * 2020-09-28 2020-12-29 四川长虹电器股份有限公司 Chinese and English mixed speech synthesis method and device
WO2022141710A1 (en) * 2020-12-28 2022-07-07 科大讯飞股份有限公司 Speech synthesis method, apparatus and device, and storage medium
CN113053355A (en) * 2021-03-17 2021-06-29 平安科技(深圳)有限公司 Fole human voice synthesis method, device, equipment and storage medium
CN113421547A (en) * 2021-06-03 2021-09-21 华为技术有限公司 Voice processing method and related equipment
CN114974208A (en) * 2022-06-20 2022-08-30 青岛大学 Chinese speech synthesis method and device, electronic equipment and storage medium
CN114974208B (en) * 2022-06-20 2024-05-31 青岛大学 Chinese speech synthesis method, device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109036377A (en) A kind of phoneme synthesizing method and device
JP7500020B2 (en) Multilingual text-to-speech synthesis method
Gold et al. Speech and audio signal processing: processing and perception of speech and music
Kawai et al. XIMERA: A new TTS from ATR based on corpus-based technologies
CN108899009B (en) Chinese speech synthesis system based on phoneme
CN109313891B (en) System and method for speech synthesis
US5930755A (en) Utilization of a recorded sound sample as a voice source in a speech synthesizer
Govind et al. Expressive speech synthesis: a review
US20060259303A1 (en) Systems and methods for pitch smoothing for text-to-speech synthesis
US8775185B2 (en) Speech samples library for text-to-speech and methods and apparatus for generating and using same
Qian et al. A cross-language state sharing and mapping approach to bilingual (Mandarin–English) TTS
US20130262120A1 (en) Speech synthesis device and speech synthesis method
WO2023279976A1 (en) Speech synthesis method, apparatus, device, and storage medium
Kardava et al. Solving the problem of the accents for speech recognition systems
CN114255738A (en) Speech synthesis method, apparatus, medium, and electronic device
Ipsic et al. Croatian HMM-based speech synthesis
RU61924U1 (en) STATISTICAL SPEECH MODEL
Nthite et al. End-to-End Text-To-Speech synthesis for under resourced South African languages
Takaki et al. Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2012
Maia et al. An HMM-based Brazilian Portuguese speech synthesizer and its characteristics
Hinterleitner et al. Speech synthesis
Georgila 19 Speech Synthesis: State of the Art and Challenges for the Future
Karjalainen Review of speech synthesis technology
KR100608643B1 (en) Pitch modelling apparatus and method for voice synthesizing system
Näslund Simulating hypernasality with phonological features in Swedish TTS

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20181218