CN109036377A - A kind of phoneme synthesizing method and device - Google Patents
A kind of phoneme synthesizing method and device Download PDFInfo
- Publication number
- CN109036377A CN109036377A CN201810834892.3A CN201810834892A CN109036377A CN 109036377 A CN109036377 A CN 109036377A CN 201810834892 A CN201810834892 A CN 201810834892A CN 109036377 A CN109036377 A CN 109036377A
- Authority
- CN
- China
- Prior art keywords
- speech
- phoneme
- feature vector
- voice
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 34
- 230000002194 synthesizing effect Effects 0.000 title abstract 2
- 239000013598 vector Substances 0.000 claims abstract description 79
- 238000012549 training Methods 0.000 claims abstract description 45
- 238000004519 manufacturing process Methods 0.000 claims description 45
- 230000015572 biosynthetic process Effects 0.000 claims description 23
- 238000003786 synthesis reaction Methods 0.000 claims description 23
- 238000013528 artificial neural network Methods 0.000 claims description 20
- 230000003936 working memory Effects 0.000 claims description 17
- 238000012163 sequencing technique Methods 0.000 claims description 14
- 230000007246 mechanism Effects 0.000 claims description 11
- 238000003062 neural network model Methods 0.000 claims description 10
- 238000009826 distribution Methods 0.000 claims description 9
- 230000015654 memory Effects 0.000 claims description 9
- 238000003860 storage Methods 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 abstract description 8
- 150000001875 compounds Chemical class 0.000 abstract 2
- 230000001537 neural effect Effects 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 12
- 238000004590 computer program Methods 0.000 description 8
- 238000001308 synthesis method Methods 0.000 description 8
- 230000005284 excitation Effects 0.000 description 7
- 238000012545 processing Methods 0.000 description 7
- 238000012986 modification Methods 0.000 description 5
- 230000004048 modification Effects 0.000 description 5
- 230000006870 function Effects 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 230000000737 periodic effect Effects 0.000 description 3
- 210000001260 vocal cord Anatomy 0.000 description 3
- 230000001755 vocal effect Effects 0.000 description 3
- 230000009467 reduction Effects 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013144 data compression Methods 0.000 description 1
- 210000004704 glottis Anatomy 0.000 description 1
- 210000004072 lung Anatomy 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Signal Processing (AREA)
- Machine Translation (AREA)
Abstract
The present invention provides a kind of phoneme synthesizing method and device, is related to field of computer technology, method comprises determining that the corresponding aligned phoneme sequence of text information to sounding;The aligned phoneme sequence includes multiple phoneme informations, and the sequence of each phoneme information is consistent with the sequence of each text in the text information;The phoneme information includes the initial consonant, simple or compound vowel of a Chinese syllable and tone of the corresponding text of phoneme information;The aligned phoneme sequence is inputted into speech utterance model, determines the corresponding speech feature vector of the text information, the speech utterance model is to carry out neural metwork training to sounding sample to obtain;The speech feature vector by playing device for being played out.Due to considering the initial consonant of Chinese sounding, the relationship of simple or compound vowel of a Chinese syllable and tone, the sound simulated has higher authenticity, and can be adapted for the various dialects being made of phoneme and other languages, has very high scalability.
Description
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for speech synthesis.
Background
Speech synthesis is a technique for generating artificial speech by mechanical, electronic methods. TTS technology (also known as text-to-speech technology) belongs to speech synthesis, and is a technology for converting text information generated by a computer or input from the outside into intelligible and fluent chinese spoken language and outputting the same.
The current speech synthesis techniques in the prior art include: the first method is a speech synthesis technology based on rule synthesis. This synthesis generates the target speech by phonetic rules. The rule synthesis system stores acoustic parameters of smaller speech units (e.g., phonemes, diphones, demi-syllables, or syllables), and rules for composing syllables from phonemes and words or sentences from syllables. And the second method is a voice synthesis technology based on waveform splicing. The synthesis mode uses sentences, phrases, words or syllables as synthesis units, these units are respectively recorded and then directly digitally coded, and then are undergone the process of proper data compression to form a synthetic speech library, and when it is replayed, according to the information to be outputted, the waveform data of correspondent unit can be taken out from corpus, and can be series-connected or edited together, and then the speech can be restored by means of decoding. And the third method is a speech synthesis technology based on parameter analysis and synthesis. Such synthesis methods mostly use syllables, demisyllables or phones as synthesis units.
However, the method one requires very complicated rules to be set, and different environments and different contexts are analyzed to set different rules. Meanwhile, the synthesized voice has low naturalness and cannot be widely applied. The second method and the third method can synthesize high-quality voice only by enough high-quality speaker records, the used voice library file is too large, and the multi-syllable characters cannot be solved at all. Synthesis cannot be performed by sound obtained from public places.
In summary, the prior art cannot provide a speech synthesis method with simple rules and high speech naturalness.
Disclosure of Invention
The invention provides a voice synthesis method and a voice synthesis device, which are used for solving the problem that the prior art can not provide a voice synthesis method with simple rules and high voice naturalness.
The embodiment of the invention provides a voice synthesis method, which comprises the following steps: determining a phoneme sequence corresponding to text information to be sounded; the phoneme sequence comprises a plurality of phoneme information, and the sequencing of each phoneme information is consistent with the sequencing of each character in the text information; the phoneme information comprises initials, finals and tones of characters corresponding to the phoneme information;
inputting the phoneme sequence into a voice sounding model, and determining a voice feature vector corresponding to the text information, wherein the voice sounding model is obtained by carrying out neural network training on a sounding sample; the voice feature vector is used for playing through a playing device.
In the embodiment of the invention, the text information to be sounded is arranged according to the arrangement mode of initials, finals and tones of characters in the text information. In the embodiment of the invention, because the relation among the initial consonant, the final sound and the tone of Chinese pronunciation is considered, the simulated sound has higher authenticity, and can be suitable for various dialects and other languages formed by phonemes, and has very high expansibility.
Further, the inputting the phoneme sequence into a speech sound generation model and determining a speech feature vector corresponding to the text information includes:
inputting speech parameters and the phoneme sequence into the speech sound generation model, wherein the speech parameters are used for indicating the identification of a speaker; the voice sound production model is obtained by carrying out neural network training on sound production samples of all speakers;
and determining the voice feature vector of the speaker corresponding to the text information through the voice production model.
In the embodiment of the invention, the voice parameters are input into the voice sound production model, the obtained voice feature vector is the voice content corresponding to the voice of the sound producer corresponding to the voice parameters, and the voice of the sound producer is used for carrying out sound production. In the embodiment of the invention, the voice of any speaker trained in the voice sounding model can be used for broadcasting.
Further, before inputting the speech parameters and the phoneme sequence into the speech sound generation model, the method further includes:
acquiring random noise, wherein the random noise conforms to normal distribution;
the inputting of the speech parameters and the phoneme sequence into the speech utterance model includes:
inputting the speech parameters, the random noise and the phoneme sequence into the speech utterance model.
In the embodiment of the invention, in order to prevent the voice feature vector output by the voice sound production model from being over-fitted, random noise conforming to normal distribution is added, so that the output voice feature vector is more accurate.
Further, the speech feature vector comprises a 60-dimensional Mel generalized cepstrum vector, a 1-dimensional aperiodic banded feature vector, a 1-dimensional log fundamental frequency feature vector and a 1-dimensional unvoiced and voiced feature vector.
In the embodiment of the invention, the selected 63-dimensional vector can better simulate Chinese pronunciation and personal pronunciation, so that the voice characteristic vector can better accord with Chinese pronunciation rules, and the voice can be better simulated to generate voice.
Further, the speech sound production model is a neural network model of a working memory mechanism.
In the embodiment of the invention, because the phoneme of each character in Chinese pronunciation has no incidence relation with the previous character, the neural network model of a working memory mechanism is used, the calculation amount of the neural network can be reduced, and the data amount of the model is simplified.
An embodiment of the present invention further provides a speech synthesis apparatus, including:
the phoneme sequence determining unit is used for determining a phoneme sequence corresponding to the text information to be sounded; the phoneme sequence comprises a plurality of phoneme information, and the sequencing of each phoneme information is consistent with the sequencing of each character in the text information; the phoneme information comprises initials, finals and tones of characters corresponding to the phoneme information;
the voice feature vector determining unit is used for inputting the phoneme sequence into a voice production model and determining a voice feature vector corresponding to the text information, wherein the voice production model is obtained by carrying out neural network training on a sound production sample; the voice feature vector is used for playing through a playing device.
In the embodiment of the invention, the text information to be sounded is arranged according to the arrangement mode of initials, finals and tones of characters in the text information. In the embodiment of the invention, because the relation among the initial consonant, the final sound and the tone of Chinese pronunciation is considered, the simulated sound has higher authenticity, and can be suitable for various dialects and other languages formed by phonemes, and has very high expansibility.
Further, the speech feature vector determination unit is specifically configured to:
inputting speech parameters and the phoneme sequence into the speech sound generation model, wherein the speech parameters are used for indicating the identification of a speaker; the voice sound production model is obtained by carrying out neural network training on sound production samples of all speakers;
and determining the voice feature vector of the speaker corresponding to the text information through the voice production model.
Further, the phoneme sequence determination unit is further configured to:
acquiring random noise, wherein the random noise conforms to normal distribution;
the speech feature vector determination unit is specifically configured to:
inputting the speech parameters, the random noise and the phoneme sequence into the speech utterance model.
Further, the speech feature vector comprises a 60-dimensional Mel generalized cepstrum vector, a 1-dimensional aperiodic banded feature vector, a 1-dimensional log fundamental frequency feature vector and a 1-dimensional unvoiced and voiced feature vector.
Further, the speech sound production model is a neural network model of a working memory mechanism.
An embodiment of the present invention further provides an electronic device, including:
at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any of the above embodiments.
Embodiments of the present invention also provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of any one of the above embodiments.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
Fig. 1 is a schematic flow chart of a speech synthesis method according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of training a speech utterance model according to an embodiment of the present invention;
FIG. 3 is a diagram of a human speech model according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating a neural network training process according to an embodiment of the present invention;
fig. 5 is a schematic flow chart of a neural network training process according to an embodiment of the present invention;
FIG. 6 is a schematic flow chart of speech synthesis according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The present invention provides a speech synthesis method, as shown in fig. 1, including:
step 101, determining a phoneme sequence corresponding to text information to be sounded; the phoneme sequence comprises a plurality of phoneme information, and the sequencing of each phoneme information is consistent with the sequencing of each character in the text information; the phoneme information comprises initials, finals and tones of characters corresponding to the phoneme information;
step 102, inputting the phoneme sequence into a voice production model, and determining a voice feature vector corresponding to the text information, wherein the voice production model is obtained by performing neural network training on a sound production sample; the voice feature vector is used for playing through a playing device.
In step 101, the text information to be uttered is the text information that needs to be speech-synthesized, in the embodiment of the present invention, the initial consonant, the final sound, and the tone of each character in the text information are in one-to-one correspondence with the phoneme sequence, and then the text information is converted into the phoneme sequence according to the correspondence between the two, and the ordering manner in the phoneme sequence is consistent with the arrangement manner of each character.
For example, in the embodiment of the present invention, each chinese character may be split into an initial consonant, a final consonant, and a tone, the initial consonant corresponds to a certain identifier in the phoneme sequence table, and the final consonant and the tone correspond to a certain identifier in the phoneme sequence table, and then the identifiers are arranged in sequence to form a phoneme sequence.
For example, "hello" is the text information to be uttered, "the pinyin of" hello "is" ni3 ", where" n "is the initial, and" i "is the final, and 3 represents the tone, and in order to better conform to the pronunciation of Chinese, the final and tone are combined together and are" i3 ". Optionally, in the embodiment of the present invention, a phoneme comparison table exists, for example, as shown in table 1:
table 1: phoneme comparison table
Through the phoneme comparison table, the initial consonant, the final sound and the tone of each text message to be sounded can be converted into a phoneme sequence, and the arrangement mode of the phoneme sequence is the same as that of the initial consonant and the final sound of each Chinese character and the sequential arrangement mode of each word.
In step 102, after the phoneme sequence is determined, the phoneme sequence is input into a speech utterance model, and an output result is determined. In the embodiment of the invention, the voice sound generation model is obtained by training a neural network according to the sound generation samples.
Optionally, in the embodiment of the present invention, the speech model is trained according to a speech sample, and the speech sample may be any available sound, including sound on the internet, speech sound, video sound, and the like.
Optionally, in the embodiment of the present invention, the playing device may be a world vocoder, the world vocoder may pronounce according to the determined speech feature vector, and the content of the pronunciation is text information to be pronounced.
Optionally, in step 102, the inputting the phoneme sequence into a speech sound generation model, and determining a speech feature vector corresponding to the text information includes:
inputting speech parameters and the phoneme sequence into the speech sound generation model, wherein the speech parameters are used for indicating the identification of a speaker; the voice sound production model is obtained by carrying out neural network training on training sound production samples of all speakers;
and determining the voice feature vector of the speaker corresponding to the text information through the voice production model.
That is to say, in addition to inputting the phoneme sequence into the speech sound generation model, the obtained speech parameters are also input into the speech sound generation model, that is, in the embodiment of the present invention, the speech sound generation model is not only trained through the sound generation sample, but also needs to be trained by adding the sound information of the speaker corresponding to the sound generation sample, and the obtained sound generation model can not only pronounce accurately, but also can be vocalized through the sound of any trained speaker.
Optionally, in the embodiment of the present invention, the input speech parameter is identification information of a speaker, and a voice of the speaker corresponding to the identification information exists in a training sample used in performing the training of the utterance model.
Optionally, in the embodiment of the present invention, as shown in fig. 2, the speech sound generation model may be trained by the method in fig. 2.
Optionally, in the embodiment of the present invention, the training sample is all audio, including audio information in video, speech, audio content such as audio program, and the like, and may be used as the training sample.
After the audio sample is obtained, the voice information of the speaker needs to be obtained for model training, so that the trained voice sound generation model can learn the pronunciation method of each speaker.
Optionally, in the embodiment of the present invention, the training samples are divided into training data and test data, and whether the training model achieves the expected target is determined according to the audio features of the training samples.
In general, when processing speech signals, the time domain waveform of the sound is not used directly, but some processing of the audio is required. The training sample is preprocessed, and acoustic features of an audio part in the training sample, namely a 63 x n vector, are extracted. Wherein the size of n is proportional to the length of the audio, and the 63-dimensional features mainly include 60-dimensional Mel-frequency cepstral (MGC) parameters: the MGC features are Mel-Frequency Cepstral Coefficients (MFCC) features after dimensionality reduction. Because the MFCC feature latitude is too high and is not suitable for direct training, the dimension reduction processing is carried out on the MFCC feature latitude, and the MGC feature is obtained.
1-dimensional aperiodic Band (BAP) characteristics: an aperiodic excitation signal; 1-dimensional log fundamental frequency lf 0: the fundamental frequency, a periodic pulse train, that characterizes the individual's voice production. 1-dimensional unvoiced/voiced feature v/uv: to represent unvoiced and voiced sounds.
In the embodiment of the present invention, the 63-dimensional vector is selected as the acoustic feature of the training sample because the 63-dimensional vector is similar to the human voice model.
Human voice model as shown in fig. 3, the source excitation portion of fig. 3 corresponds to excitation by the combined action of airflow from the lungs and vocal cords, and the vocal tract resonance portion corresponds to the tuning movement of the vocal tract. Source excitations are classified into voiced and unvoiced classes: the voiced sound is generated by that the airflow passes through the vocal cords, the vocal cords are impacted to generate vibration, and quasi-periodic pulses are generated at the glottis, so that the periodic pulse excitation can be used for representing the voiced sound; similarly, the unvoiced sound band does not vibrate, and the air flow directly enters the sound channel, so that the noise is reduced to white noise. The vocal tract resonance (voicing mode) of a person corresponds to a linear time-invariant system whose response (convolution) to the input signal is the speech uttered by the person. In the above paragraph, f0 corresponds to periodic impulse excitation and BAP corresponds to aperiodic excitation, while the channel resonance h (n) can also be derived by the MGC.
In the embodiment of the invention, the training sample is input into the primary training model to obtain a primary training result, and the primary training result is compared with the audio features of the 63-dimensional training sample, and the primary training model is adjusted until the training is finished to obtain the voice production model.
Optionally, in the embodiment of the present invention, the training speech sound generation model may be a neural network model of a working memory mechanism, and since in chinese pronunciation, pronunciation of each character is only related to a phoneme corresponding to each character and is unrelated to preceding and following characters, in the embodiment of the present invention, the neural network model of the working memory mechanism is adopted to reduce training duration and difficulty.
As shown in fig. 4, when the training sample is "i is a chinese", the determined sequence of the initials, finals and tones is "w, o3, shi4, zh, o1ng, g, u, o2, r, e2 n"; the Speaker ID is identification information of a Speaker who is a Chinese, can be an identification code of the Speaker, the identification code is an identification of the Speaker stored in the voice sound production model, can be a serial number, and can also be an identification number of the Speaker, and the identification code is used as a unique identification of the Speaker.
Inputting the content into a neural network model of a working memory mechanism, updating working memory units, and determining which working memory units need to be memorized and which working memory units need to be deleted.
For example, in the embodiment of the present invention, the system outputs n iterative 63-dimensional vectors, each of which is referred to as an output unit. The whole working memory model consists of k working memory units, and the output unit at the time t is determined by the current Bt. The updating method of B is shown in fig. 5:
step 501, acquiring the text content of the current iteration based on an attention mechanism;
step 502, obtaining Speaker ID of a Speaker, a last iteration output unit and current working memory Bt;
step 503, calculating the updated working memory unit u (the bold frame in the figure) according to the above four quantities;
and step 504, removing B [ k ] in B, translating the rest units to the right, and making B [1] be u to obtain new B.
Optionally, in the embodiment of the present invention, after the voice utterance model is determined, a voice feature vector may be obtained according to the text to be uttered, the identification information of the utterer who needs to utter, and the voice utterance model, and the voice feature vector is input to a playing device, such as a world vocoder, so that the text to be uttered may be played by using the voice of the utterer.
Optionally, in the embodiment of the present invention, in order to prevent the voice production model from being over-fitted, random noise complying with standard normal distribution may be added when determining the voice feature vector, so as to increase the diversity of the voice production model samples, so that the voice production model is not over-fitted, and the obtained voice feature vector is more accurate.
For example, as shown in fig. 6, performing speech synthesis includes several steps, respectively:
step 601, converting a string of Chinese texts to be generated into a series of corresponding phoneme sequences through the corresponding relation between initials, finals, tones and a phoneme table;
step 602, generating a section of random noise which follows standard normal distribution;
step 603, inputting the random noise, the id of the speaker to be simulated and the phoneme sequence into the trained neural network module, and outputting a string of audio acoustic features close to the specified speaker by the trained neural network module;
step 604, inputting the acoustic characteristics of the audio into the WORLD vocoder, wherein the vocoder outputs a section of audio, and the content of the audio is a string of input Chinese texts, and the pronunciation is the pronunciation of a specified speaker.
Based on the same concept, an embodiment of the present invention further provides a speech synthesis apparatus, as shown in fig. 7, including:
a phoneme sequence determining unit 701, configured to determine a phoneme sequence corresponding to text information to be uttered; the phoneme sequence comprises a plurality of phoneme information, and the sequencing of each phoneme information is consistent with the sequencing of each character in the text information; the phoneme information comprises initials, finals and tones of characters corresponding to the phoneme information;
a speech feature vector determining unit 702, configured to input the phoneme sequence into a speech model, and determine a speech feature vector corresponding to the text information, where the speech model is obtained by performing neural network training on a speech sample; the voice feature vector is used for playing through a playing device.
Further, the speech feature vector determining unit 702 is specifically configured to:
inputting speech parameters and the phoneme sequence into the speech sound generation model, wherein the speech parameters are used for indicating the identification of a speaker; the voice sound production model is obtained by carrying out neural network training on training sound production samples of all speakers;
and determining the voice feature vector of the speaker corresponding to the text information through the voice production model.
Further, the phoneme sequence determining unit 701 is further configured to:
acquiring random noise, wherein the random noise conforms to normal distribution;
the speech feature vector determination unit is specifically configured to:
inputting the speech parameters, the random noise and the phoneme sequence into the speech utterance model.
Further, the speech feature vector comprises a 60-dimensional Mel generalized cepstrum vector, a 1-dimensional aperiodic banded feature vector, a 1-dimensional log fundamental frequency feature vector and a 1-dimensional unvoiced and voiced feature vector.
Further, the speech sound production model is a neural network model of a working memory mechanism.
Based on the same principle, the present invention also provides an electronic device, as shown in fig. 8, including:
the system comprises a processor 801, a memory 802, a transceiver 803 and a bus interface 804, wherein the processor 801, the memory 802 and the transceiver 803 are connected through the bus interface 804;
the processor 801 is configured to read the program in the memory 802, and execute the following method:
determining a phoneme sequence corresponding to text information to be sounded; the phoneme sequence comprises a plurality of phoneme information, and the sequencing of each phoneme information is consistent with the sequencing of each character in the text information; the phoneme information comprises initials, finals and tones of characters corresponding to the phoneme information;
inputting the phoneme sequence into a voice sounding model, and determining a voice feature vector corresponding to the text information, wherein the voice sounding model is obtained by carrying out neural network training on a sounding sample; the voice feature vector is used for playing through a playing device.
Further, the processor 801 is specifically configured to:
inputting speech parameters and the phoneme sequence into the speech sound generation model, wherein the speech parameters are used for indicating the identification of a speaker; the voice sound production model is obtained by carrying out neural network training on training sound production samples of all speakers;
and determining the voice feature vector of the speaker corresponding to the text information through the voice production model.
Further, the processor 801 acquires random noise through the transceiver 803, wherein the random noise conforms to a normal distribution;
the processor 801 is specifically configured to:
inputting the speech parameters, the random noise and the phoneme sequence into the speech utterance model.
Further, the speech feature vector comprises a 60-dimensional Mel generalized cepstrum vector, a 1-dimensional aperiodic banded feature vector, a 1-dimensional log fundamental frequency feature vector and a 1-dimensional unvoiced and voiced feature vector.
Further, the speech sound production model is a neural network model of a working memory mechanism.
The present application provides a computer program product including a computer program stored on a non-transitory computer readable storage medium, the computer program including program instructions which, when executed by a computer, cause the computer to perform any one of the above data query methods.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.
Claims (12)
1. A method of speech synthesis, the method comprising:
determining a phoneme sequence corresponding to text information to be sounded; the phoneme sequence comprises a plurality of phoneme information, and the sequencing of each phoneme information is consistent with the sequencing of each character in the text information; the phoneme information comprises initials, finals and tones of characters corresponding to the phoneme information;
inputting the phoneme sequence into a voice sounding model, and determining a voice feature vector corresponding to the text information, wherein the voice sounding model is obtained by carrying out neural network training on a sounding sample; the voice feature vector is used for playing through a playing device.
2. The method of claim 1, wherein said inputting said sequence of phonemes into a speech utterance model and determining a speech feature vector corresponding to said text information comprises:
inputting speech parameters and the phoneme sequence into the speech sound generation model, wherein the speech parameters are used for indicating the identification of a speaker; the voice sound production model is obtained by carrying out neural network training on sound production samples of all speakers;
and determining the voice feature vector of the speaker corresponding to the text information through the voice production model.
3. The method of claim 2, wherein said inputting speech parameters and said sequence of phonemes into said speech utterance model further comprises:
acquiring random noise, wherein the random noise conforms to normal distribution;
the inputting of the speech parameters and the phoneme sequence into the speech utterance model includes:
inputting the speech parameters, the random noise and the phoneme sequence into the speech utterance model.
4. The method of any one of claims 1 to 3, wherein the speech feature vectors include a 60-dimensional Mel generalized cepstral vector, a 1-dimensional aperiodic banded feature vector, a 1-dimensional log fundamental feature vector, and a 1-dimensional unvoiced feature vector.
5. The method of claim 4, wherein the speech-sounding model is a neural network model of a working memory mechanism.
6. A speech synthesis apparatus, comprising:
the phoneme sequence determining unit is used for determining a phoneme sequence corresponding to the text information to be sounded; the phoneme sequence comprises a plurality of phoneme information, and the sequencing of each phoneme information is consistent with the sequencing of each character in the text information; the phoneme information comprises initials, finals and tones of characters corresponding to the phoneme information;
the voice feature vector determining unit is used for inputting the phoneme sequence into a voice production model and determining a voice feature vector corresponding to the text information, wherein the voice production model is obtained by carrying out neural network training on a sound production sample; the voice feature vector is used for playing through a playing device.
7. The apparatus of claim 6, wherein the speech feature vector determination unit is specifically configured to:
inputting speech parameters and the phoneme sequence into the speech sound generation model, wherein the speech parameters are used for indicating the identification of a speaker; the voice sound production model is obtained by carrying out neural network training on sound production samples of all speakers;
and determining the voice feature vector of the speaker corresponding to the text information through the voice production model.
8. The apparatus of claim 7, wherein the phoneme sequence determination unit is further to:
acquiring random noise, wherein the random noise conforms to normal distribution;
the speech feature vector determination unit is specifically configured to:
inputting the speech parameters, the random noise and the phoneme sequence into the speech utterance model.
9. The apparatus according to any one of claims 6 to 8, wherein the speech feature vectors include a 60-dimensional Mel generalized cepstral vector, a 1-dimensional aperiodic banded feature vector, a 1-dimensional log fundamental frequency feature vector, and a 1-dimensional unvoiced and voiced feature vector.
10. The apparatus of claim 9, wherein the speech production model is a neural network model of a working memory mechanism.
11. An electronic device, comprising:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.
12. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810834892.3A CN109036377A (en) | 2018-07-26 | 2018-07-26 | A kind of phoneme synthesizing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810834892.3A CN109036377A (en) | 2018-07-26 | 2018-07-26 | A kind of phoneme synthesizing method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109036377A true CN109036377A (en) | 2018-12-18 |
Family
ID=64645616
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810834892.3A Pending CN109036377A (en) | 2018-07-26 | 2018-07-26 | A kind of phoneme synthesizing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109036377A (en) |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109785823A (en) * | 2019-01-22 | 2019-05-21 | 中财颐和科技发展(北京)有限公司 | Phoneme synthesizing method and system |
CN110070852A (en) * | 2019-04-26 | 2019-07-30 | 平安科技(深圳)有限公司 | Synthesize method, apparatus, equipment and the storage medium of Chinese speech |
CN110335608A (en) * | 2019-06-17 | 2019-10-15 | 平安科技(深圳)有限公司 | Voice print verification method, apparatus, equipment and storage medium |
CN110619867A (en) * | 2019-09-27 | 2019-12-27 | 百度在线网络技术(北京)有限公司 | Training method and device of speech synthesis model, electronic equipment and storage medium |
CN110880327A (en) * | 2019-10-29 | 2020-03-13 | 平安科技(深圳)有限公司 | Audio signal processing method and device |
CN110956948A (en) * | 2020-01-03 | 2020-04-03 | 北京海天瑞声科技股份有限公司 | End-to-end speech synthesis method, device and storage medium |
CN111583900A (en) * | 2020-04-27 | 2020-08-25 | 北京字节跳动网络技术有限公司 | Song synthesis method and device, readable medium and electronic equipment |
CN111653265A (en) * | 2020-04-26 | 2020-09-11 | 北京大米科技有限公司 | Speech synthesis method, speech synthesis device, storage medium and electronic equipment |
CN111785248A (en) * | 2020-03-12 | 2020-10-16 | 北京京东尚科信息技术有限公司 | Text information processing method and device |
CN112151005A (en) * | 2020-09-28 | 2020-12-29 | 四川长虹电器股份有限公司 | Chinese and English mixed speech synthesis method and device |
CN112287112A (en) * | 2019-07-25 | 2021-01-29 | 北京中关村科金技术有限公司 | Method, device and storage medium for constructing special pronunciation dictionary |
CN113053355A (en) * | 2021-03-17 | 2021-06-29 | 平安科技(深圳)有限公司 | Fole human voice synthesis method, device, equipment and storage medium |
WO2021127821A1 (en) * | 2019-12-23 | 2021-07-01 | 深圳市优必选科技股份有限公司 | Speech synthesis model training method, apparatus, computer device, and storage medium |
CN113314096A (en) * | 2020-02-25 | 2021-08-27 | 阿里巴巴集团控股有限公司 | Speech synthesis method, apparatus, device and storage medium |
CN113421547A (en) * | 2021-06-03 | 2021-09-21 | 华为技术有限公司 | Voice processing method and related equipment |
WO2022141710A1 (en) * | 2020-12-28 | 2022-07-07 | 科大讯飞股份有限公司 | Speech synthesis method, apparatus and device, and storage medium |
US11417314B2 (en) | 2019-09-19 | 2022-08-16 | Baidu Online Network Technology (Beijing) Co., Ltd. | Speech synthesis method, speech synthesis device, and electronic apparatus |
CN114974208A (en) * | 2022-06-20 | 2022-08-30 | 青岛大学 | Chinese speech synthesis method and device, electronic equipment and storage medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102270449A (en) * | 2011-08-10 | 2011-12-07 | 歌尔声学股份有限公司 | Method and system for synthesising parameter speech |
CN103236259A (en) * | 2013-03-22 | 2013-08-07 | 乐金电子研发中心(上海)有限公司 | Voice recognition processing and feedback system, voice response method |
CN105956588A (en) * | 2016-04-21 | 2016-09-21 | 深圳前海勇艺达机器人有限公司 | Method of intelligent scanning and text reading and robot device |
CN106205601A (en) * | 2015-05-06 | 2016-12-07 | 科大讯飞股份有限公司 | Determine the method and system of text voice unit |
JP2017107228A (en) * | 2017-02-20 | 2017-06-15 | 株式会社テクノスピーチ | Singing voice synthesis device and singing voice synthesis method |
CN107452408A (en) * | 2017-07-27 | 2017-12-08 | 上海与德科技有限公司 | A kind of audio frequency playing method and device |
CN107705783A (en) * | 2017-11-27 | 2018-02-16 | 北京搜狗科技发展有限公司 | A kind of phoneme synthesizing method and device |
CN107945786A (en) * | 2017-11-27 | 2018-04-20 | 北京百度网讯科技有限公司 | Phoneme synthesizing method and device |
-
2018
- 2018-07-26 CN CN201810834892.3A patent/CN109036377A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102270449A (en) * | 2011-08-10 | 2011-12-07 | 歌尔声学股份有限公司 | Method and system for synthesising parameter speech |
CN103236259A (en) * | 2013-03-22 | 2013-08-07 | 乐金电子研发中心(上海)有限公司 | Voice recognition processing and feedback system, voice response method |
CN106205601A (en) * | 2015-05-06 | 2016-12-07 | 科大讯飞股份有限公司 | Determine the method and system of text voice unit |
CN105956588A (en) * | 2016-04-21 | 2016-09-21 | 深圳前海勇艺达机器人有限公司 | Method of intelligent scanning and text reading and robot device |
JP2017107228A (en) * | 2017-02-20 | 2017-06-15 | 株式会社テクノスピーチ | Singing voice synthesis device and singing voice synthesis method |
CN107452408A (en) * | 2017-07-27 | 2017-12-08 | 上海与德科技有限公司 | A kind of audio frequency playing method and device |
CN107705783A (en) * | 2017-11-27 | 2018-02-16 | 北京搜狗科技发展有限公司 | A kind of phoneme synthesizing method and device |
CN107945786A (en) * | 2017-11-27 | 2018-04-20 | 北京百度网讯科技有限公司 | Phoneme synthesizing method and device |
Non-Patent Citations (2)
Title |
---|
凌震华: "《基于统计声学建模的语音合成技术研究》", 《信息科技辑》 * |
崔宣: "《基于语音混合特征说话人识别的研究》", 《信息科技辑》 * |
Cited By (28)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109785823A (en) * | 2019-01-22 | 2019-05-21 | 中财颐和科技发展(北京)有限公司 | Phoneme synthesizing method and system |
CN110070852A (en) * | 2019-04-26 | 2019-07-30 | 平安科技(深圳)有限公司 | Synthesize method, apparatus, equipment and the storage medium of Chinese speech |
CN110070852B (en) * | 2019-04-26 | 2023-06-16 | 平安科技(深圳)有限公司 | Method, device, equipment and storage medium for synthesizing Chinese voice |
WO2020215551A1 (en) * | 2019-04-26 | 2020-10-29 | 平安科技(深圳)有限公司 | Chinese speech synthesizing method, apparatus and device, storage medium |
CN110335608A (en) * | 2019-06-17 | 2019-10-15 | 平安科技(深圳)有限公司 | Voice print verification method, apparatus, equipment and storage medium |
CN110335608B (en) * | 2019-06-17 | 2023-11-28 | 平安科技(深圳)有限公司 | Voiceprint verification method, voiceprint verification device, voiceprint verification equipment and storage medium |
CN112287112A (en) * | 2019-07-25 | 2021-01-29 | 北京中关村科金技术有限公司 | Method, device and storage medium for constructing special pronunciation dictionary |
US11417314B2 (en) | 2019-09-19 | 2022-08-16 | Baidu Online Network Technology (Beijing) Co., Ltd. | Speech synthesis method, speech synthesis device, and electronic apparatus |
CN110619867A (en) * | 2019-09-27 | 2019-12-27 | 百度在线网络技术(北京)有限公司 | Training method and device of speech synthesis model, electronic equipment and storage medium |
US11488577B2 (en) | 2019-09-27 | 2022-11-01 | Baidu Online Network Technology (Beijing) Co., Ltd. | Training method and apparatus for a speech synthesis model, and storage medium |
CN110619867B (en) * | 2019-09-27 | 2020-11-03 | 百度在线网络技术(北京)有限公司 | Training method and device of speech synthesis model, electronic equipment and storage medium |
CN110880327A (en) * | 2019-10-29 | 2020-03-13 | 平安科技(深圳)有限公司 | Audio signal processing method and device |
WO2021127821A1 (en) * | 2019-12-23 | 2021-07-01 | 深圳市优必选科技股份有限公司 | Speech synthesis model training method, apparatus, computer device, and storage medium |
CN110956948A (en) * | 2020-01-03 | 2020-04-03 | 北京海天瑞声科技股份有限公司 | End-to-end speech synthesis method, device and storage medium |
CN113314096A (en) * | 2020-02-25 | 2021-08-27 | 阿里巴巴集团控股有限公司 | Speech synthesis method, apparatus, device and storage medium |
CN111785248B (en) * | 2020-03-12 | 2023-06-23 | 北京汇钧科技有限公司 | Text information processing method and device |
CN111785248A (en) * | 2020-03-12 | 2020-10-16 | 北京京东尚科信息技术有限公司 | Text information processing method and device |
CN111653265A (en) * | 2020-04-26 | 2020-09-11 | 北京大米科技有限公司 | Speech synthesis method, speech synthesis device, storage medium and electronic equipment |
CN111653265B (en) * | 2020-04-26 | 2023-08-18 | 北京大米科技有限公司 | Speech synthesis method, device, storage medium and electronic equipment |
CN111583900B (en) * | 2020-04-27 | 2022-01-07 | 北京字节跳动网络技术有限公司 | Song synthesis method and device, readable medium and electronic equipment |
CN111583900A (en) * | 2020-04-27 | 2020-08-25 | 北京字节跳动网络技术有限公司 | Song synthesis method and device, readable medium and electronic equipment |
CN112151005B (en) * | 2020-09-28 | 2022-08-19 | 四川长虹电器股份有限公司 | Chinese and English mixed speech synthesis method and device |
CN112151005A (en) * | 2020-09-28 | 2020-12-29 | 四川长虹电器股份有限公司 | Chinese and English mixed speech synthesis method and device |
WO2022141710A1 (en) * | 2020-12-28 | 2022-07-07 | 科大讯飞股份有限公司 | Speech synthesis method, apparatus and device, and storage medium |
CN113053355A (en) * | 2021-03-17 | 2021-06-29 | 平安科技(深圳)有限公司 | Fole human voice synthesis method, device, equipment and storage medium |
CN113421547A (en) * | 2021-06-03 | 2021-09-21 | 华为技术有限公司 | Voice processing method and related equipment |
CN114974208A (en) * | 2022-06-20 | 2022-08-30 | 青岛大学 | Chinese speech synthesis method and device, electronic equipment and storage medium |
CN114974208B (en) * | 2022-06-20 | 2024-05-31 | 青岛大学 | Chinese speech synthesis method, device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109036377A (en) | A kind of phoneme synthesizing method and device | |
JP7500020B2 (en) | Multilingual text-to-speech synthesis method | |
Gold et al. | Speech and audio signal processing: processing and perception of speech and music | |
Kawai et al. | XIMERA: A new TTS from ATR based on corpus-based technologies | |
CN108899009B (en) | Chinese speech synthesis system based on phoneme | |
CN109313891B (en) | System and method for speech synthesis | |
US5930755A (en) | Utilization of a recorded sound sample as a voice source in a speech synthesizer | |
Govind et al. | Expressive speech synthesis: a review | |
US20060259303A1 (en) | Systems and methods for pitch smoothing for text-to-speech synthesis | |
US8775185B2 (en) | Speech samples library for text-to-speech and methods and apparatus for generating and using same | |
Qian et al. | A cross-language state sharing and mapping approach to bilingual (Mandarin–English) TTS | |
US20130262120A1 (en) | Speech synthesis device and speech synthesis method | |
WO2023279976A1 (en) | Speech synthesis method, apparatus, device, and storage medium | |
Kardava et al. | Solving the problem of the accents for speech recognition systems | |
CN114255738A (en) | Speech synthesis method, apparatus, medium, and electronic device | |
Ipsic et al. | Croatian HMM-based speech synthesis | |
RU61924U1 (en) | STATISTICAL SPEECH MODEL | |
Nthite et al. | End-to-End Text-To-Speech synthesis for under resourced South African languages | |
Takaki et al. | Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2012 | |
Maia et al. | An HMM-based Brazilian Portuguese speech synthesizer and its characteristics | |
Hinterleitner et al. | Speech synthesis | |
Georgila | 19 Speech Synthesis: State of the Art and Challenges for the Future | |
Karjalainen | Review of speech synthesis technology | |
KR100608643B1 (en) | Pitch modelling apparatus and method for voice synthesizing system | |
Näslund | Simulating hypernasality with phonological features in Swedish TTS |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181218 |