CN109036377A

CN109036377A - A kind of phoneme synthesizing method and device

Info

Publication number: CN109036377A
Application number: CN201810834892.3A
Authority: CN
Inventors: 何树民; 徐文韬; 陈玉玲
Original assignee: China Unionpay Co Ltd
Current assignee: China Unionpay Co Ltd
Priority date: 2018-07-26
Filing date: 2018-07-26
Publication date: 2018-12-18

Abstract

The present invention provides a kind of phoneme synthesizing method and device, is related to field of computer technology, method comprises determining that the corresponding aligned phoneme sequence of text information to sounding；The aligned phoneme sequence includes multiple phoneme informations, and the sequence of each phoneme information is consistent with the sequence of each text in the text information；The phoneme information includes the initial consonant, simple or compound vowel of a Chinese syllable and tone of the corresponding text of phoneme information；The aligned phoneme sequence is inputted into speech utterance model, determines the corresponding speech feature vector of the text information, the speech utterance model is to carry out neural metwork training to sounding sample to obtain；The speech feature vector by playing device for being played out.Due to considering the initial consonant of Chinese sounding, the relationship of simple or compound vowel of a Chinese syllable and tone, the sound simulated has higher authenticity, and can be adapted for the various dialects being made of phoneme and other languages, has very high scalability.

Description

Voice synthesis method and device

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for speech synthesis.

Background

Speech synthesis is a technique for generating artificial speech by mechanical, electronic methods. TTS technology (also known as text-to-speech technology) belongs to speech synthesis, and is a technology for converting text information generated by a computer or input from the outside into intelligible and fluent chinese spoken language and outputting the same.

The current speech synthesis techniques in the prior art include: the first method is a speech synthesis technology based on rule synthesis. This synthesis generates the target speech by phonetic rules. The rule synthesis system stores acoustic parameters of smaller speech units (e.g., phonemes, diphones, demi-syllables, or syllables), and rules for composing syllables from phonemes and words or sentences from syllables. And the second method is a voice synthesis technology based on waveform splicing. The synthesis mode uses sentences, phrases, words or syllables as synthesis units, these units are respectively recorded and then directly digitally coded, and then are undergone the process of proper data compression to form a synthetic speech library, and when it is replayed, according to the information to be outputted, the waveform data of correspondent unit can be taken out from corpus, and can be series-connected or edited together, and then the speech can be restored by means of decoding. And the third method is a speech synthesis technology based on parameter analysis and synthesis. Such synthesis methods mostly use syllables, demisyllables or phones as synthesis units.

However, the method one requires very complicated rules to be set, and different environments and different contexts are analyzed to set different rules. Meanwhile, the synthesized voice has low naturalness and cannot be widely applied. The second method and the third method can synthesize high-quality voice only by enough high-quality speaker records, the used voice library file is too large, and the multi-syllable characters cannot be solved at all. Synthesis cannot be performed by sound obtained from public places.

In summary, the prior art cannot provide a speech synthesis method with simple rules and high speech naturalness.

Disclosure of Invention

The invention provides a voice synthesis method and a voice synthesis device, which are used for solving the problem that the prior art can not provide a voice synthesis method with simple rules and high voice naturalness.

The embodiment of the invention provides a voice synthesis method, which comprises the following steps: determining a phoneme sequence corresponding to text information to be sounded; the phoneme sequence comprises a plurality of phoneme information, and the sequencing of each phoneme information is consistent with the sequencing of each character in the text information; the phoneme information comprises initials, finals and tones of characters corresponding to the phoneme information;

inputting the phoneme sequence into a voice sounding model, and determining a voice feature vector corresponding to the text information, wherein the voice sounding model is obtained by carrying out neural network training on a sounding sample; the voice feature vector is used for playing through a playing device.

In the embodiment of the invention, the text information to be sounded is arranged according to the arrangement mode of initials, finals and tones of characters in the text information. In the embodiment of the invention, because the relation among the initial consonant, the final sound and the tone of Chinese pronunciation is considered, the simulated sound has higher authenticity, and can be suitable for various dialects and other languages formed by phonemes, and has very high expansibility.

Further, the inputting the phoneme sequence into a speech sound generation model and determining a speech feature vector corresponding to the text information includes:

inputting speech parameters and the phoneme sequence into the speech sound generation model, wherein the speech parameters are used for indicating the identification of a speaker; the voice sound production model is obtained by carrying out neural network training on sound production samples of all speakers;

and determining the voice feature vector of the speaker corresponding to the text information through the voice production model.

In the embodiment of the invention, the voice parameters are input into the voice sound production model, the obtained voice feature vector is the voice content corresponding to the voice of the sound producer corresponding to the voice parameters, and the voice of the sound producer is used for carrying out sound production. In the embodiment of the invention, the voice of any speaker trained in the voice sounding model can be used for broadcasting.

Further, before inputting the speech parameters and the phoneme sequence into the speech sound generation model, the method further includes:

acquiring random noise, wherein the random noise conforms to normal distribution;

the inputting of the speech parameters and the phoneme sequence into the speech utterance model includes:

inputting the speech parameters, the random noise and the phoneme sequence into the speech utterance model.

In the embodiment of the invention, in order to prevent the voice feature vector output by the voice sound production model from being over-fitted, random noise conforming to normal distribution is added, so that the output voice feature vector is more accurate.

Further, the speech feature vector comprises a 60-dimensional Mel generalized cepstrum vector, a 1-dimensional aperiodic banded feature vector, a 1-dimensional log fundamental frequency feature vector and a 1-dimensional unvoiced and voiced feature vector.

In the embodiment of the invention, the selected 63-dimensional vector can better simulate Chinese pronunciation and personal pronunciation, so that the voice characteristic vector can better accord with Chinese pronunciation rules, and the voice can be better simulated to generate voice.

Further, the speech sound production model is a neural network model of a working memory mechanism.

In the embodiment of the invention, because the phoneme of each character in Chinese pronunciation has no incidence relation with the previous character, the neural network model of a working memory mechanism is used, the calculation amount of the neural network can be reduced, and the data amount of the model is simplified.

An embodiment of the present invention further provides a speech synthesis apparatus, including:

the phoneme sequence determining unit is used for determining a phoneme sequence corresponding to the text information to be sounded; the phoneme sequence comprises a plurality of phoneme information, and the sequencing of each phoneme information is consistent with the sequencing of each character in the text information; the phoneme information comprises initials, finals and tones of characters corresponding to the phoneme information;

the voice feature vector determining unit is used for inputting the phoneme sequence into a voice production model and determining a voice feature vector corresponding to the text information, wherein the voice production model is obtained by carrying out neural network training on a sound production sample; the voice feature vector is used for playing through a playing device.

Further, the speech feature vector determination unit is specifically configured to:

Further, the phoneme sequence determination unit is further configured to:

the speech feature vector determination unit is specifically configured to:

An embodiment of the present invention further provides an electronic device, including:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any of the above embodiments.

Embodiments of the present invention also provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of any one of the above embodiments.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic flow chart of a speech synthesis method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of training a speech utterance model according to an embodiment of the present invention;

FIG. 3 is a diagram of a human speech model according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating a neural network training process according to an embodiment of the present invention;

fig. 5 is a schematic flow chart of a neural network training process according to an embodiment of the present invention;

FIG. 6 is a schematic flow chart of speech synthesis according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The present invention provides a speech synthesis method, as shown in fig. 1, including:

step 101, determining a phoneme sequence corresponding to text information to be sounded; the phoneme sequence comprises a plurality of phoneme information, and the sequencing of each phoneme information is consistent with the sequencing of each character in the text information; the phoneme information comprises initials, finals and tones of characters corresponding to the phoneme information;

step 102, inputting the phoneme sequence into a voice production model, and determining a voice feature vector corresponding to the text information, wherein the voice production model is obtained by performing neural network training on a sound production sample; the voice feature vector is used for playing through a playing device.

In step 101, the text information to be uttered is the text information that needs to be speech-synthesized, in the embodiment of the present invention, the initial consonant, the final sound, and the tone of each character in the text information are in one-to-one correspondence with the phoneme sequence, and then the text information is converted into the phoneme sequence according to the correspondence between the two, and the ordering manner in the phoneme sequence is consistent with the arrangement manner of each character.

For example, in the embodiment of the present invention, each chinese character may be split into an initial consonant, a final consonant, and a tone, the initial consonant corresponds to a certain identifier in the phoneme sequence table, and the final consonant and the tone correspond to a certain identifier in the phoneme sequence table, and then the identifiers are arranged in sequence to form a phoneme sequence.

For example, "hello" is the text information to be uttered, "the pinyin of" hello "is" ni3 ", where" n "is the initial, and" i "is the final, and 3 represents the tone, and in order to better conform to the pronunciation of Chinese, the final and tone are combined together and are" i3 ". Optionally, in the embodiment of the present invention, a phoneme comparison table exists, for example, as shown in table 1:

table 1: phoneme comparison table

Through the phoneme comparison table, the initial consonant, the final sound and the tone of each text message to be sounded can be converted into a phoneme sequence, and the arrangement mode of the phoneme sequence is the same as that of the initial consonant and the final sound of each Chinese character and the sequential arrangement mode of each word.

In step 102, after the phoneme sequence is determined, the phoneme sequence is input into a speech utterance model, and an output result is determined. In the embodiment of the invention, the voice sound generation model is obtained by training a neural network according to the sound generation samples.

Optionally, in the embodiment of the present invention, the speech model is trained according to a speech sample, and the speech sample may be any available sound, including sound on the internet, speech sound, video sound, and the like.

Optionally, in the embodiment of the present invention, the playing device may be a world vocoder, the world vocoder may pronounce according to the determined speech feature vector, and the content of the pronunciation is text information to be pronounced.

Optionally, in step 102, the inputting the phoneme sequence into a speech sound generation model, and determining a speech feature vector corresponding to the text information includes:

inputting speech parameters and the phoneme sequence into the speech sound generation model, wherein the speech parameters are used for indicating the identification of a speaker; the voice sound production model is obtained by carrying out neural network training on training sound production samples of all speakers;

That is to say, in addition to inputting the phoneme sequence into the speech sound generation model, the obtained speech parameters are also input into the speech sound generation model, that is, in the embodiment of the present invention, the speech sound generation model is not only trained through the sound generation sample, but also needs to be trained by adding the sound information of the speaker corresponding to the sound generation sample, and the obtained sound generation model can not only pronounce accurately, but also can be vocalized through the sound of any trained speaker.

Optionally, in the embodiment of the present invention, the input speech parameter is identification information of a speaker, and a voice of the speaker corresponding to the identification information exists in a training sample used in performing the training of the utterance model.

Optionally, in the embodiment of the present invention, as shown in fig. 2, the speech sound generation model may be trained by the method in fig. 2.

Optionally, in the embodiment of the present invention, the training sample is all audio, including audio information in video, speech, audio content such as audio program, and the like, and may be used as the training sample.

After the audio sample is obtained, the voice information of the speaker needs to be obtained for model training, so that the trained voice sound generation model can learn the pronunciation method of each speaker.

Optionally, in the embodiment of the present invention, the training samples are divided into training data and test data, and whether the training model achieves the expected target is determined according to the audio features of the training samples.

In general, when processing speech signals, the time domain waveform of the sound is not used directly, but some processing of the audio is required. The training sample is preprocessed, and acoustic features of an audio part in the training sample, namely a 63 x n vector, are extracted. Wherein the size of n is proportional to the length of the audio, and the 63-dimensional features mainly include 60-dimensional Mel-frequency cepstral (MGC) parameters: the MGC features are Mel-Frequency Cepstral Coefficients (MFCC) features after dimensionality reduction. Because the MFCC feature latitude is too high and is not suitable for direct training, the dimension reduction processing is carried out on the MFCC feature latitude, and the MGC feature is obtained.

1-dimensional aperiodic Band (BAP) characteristics: an aperiodic excitation signal; 1-dimensional log fundamental frequency lf 0: the fundamental frequency, a periodic pulse train, that characterizes the individual's voice production. 1-dimensional unvoiced/voiced feature v/uv: to represent unvoiced and voiced sounds.

In the embodiment of the present invention, the 63-dimensional vector is selected as the acoustic feature of the training sample because the 63-dimensional vector is similar to the human voice model.

Human voice model as shown in fig. 3, the source excitation portion of fig. 3 corresponds to excitation by the combined action of airflow from the lungs and vocal cords, and the vocal tract resonance portion corresponds to the tuning movement of the vocal tract. Source excitations are classified into voiced and unvoiced classes: the voiced sound is generated by that the airflow passes through the vocal cords, the vocal cords are impacted to generate vibration, and quasi-periodic pulses are generated at the glottis, so that the periodic pulse excitation can be used for representing the voiced sound; similarly, the unvoiced sound band does not vibrate, and the air flow directly enters the sound channel, so that the noise is reduced to white noise. The vocal tract resonance (voicing mode) of a person corresponds to a linear time-invariant system whose response (convolution) to the input signal is the speech uttered by the person. In the above paragraph, f0 corresponds to periodic impulse excitation and BAP corresponds to aperiodic excitation, while the channel resonance h (n) can also be derived by the MGC.

In the embodiment of the invention, the training sample is input into the primary training model to obtain a primary training result, and the primary training result is compared with the audio features of the 63-dimensional training sample, and the primary training model is adjusted until the training is finished to obtain the voice production model.

Optionally, in the embodiment of the present invention, the training speech sound generation model may be a neural network model of a working memory mechanism, and since in chinese pronunciation, pronunciation of each character is only related to a phoneme corresponding to each character and is unrelated to preceding and following characters, in the embodiment of the present invention, the neural network model of the working memory mechanism is adopted to reduce training duration and difficulty.

As shown in fig. 4, when the training sample is "i is a chinese", the determined sequence of the initials, finals and tones is "w, o3, shi4, zh, o1ng, g, u, o2, r, e2 n"; the Speaker ID is identification information of a Speaker who is a Chinese, can be an identification code of the Speaker, the identification code is an identification of the Speaker stored in the voice sound production model, can be a serial number, and can also be an identification number of the Speaker, and the identification code is used as a unique identification of the Speaker.

Inputting the content into a neural network model of a working memory mechanism, updating working memory units, and determining which working memory units need to be memorized and which working memory units need to be deleted.

For example, in the embodiment of the present invention, the system outputs n iterative 63-dimensional vectors, each of which is referred to as an output unit. The whole working memory model consists of k working memory units, and the output unit at the time t is determined by the current Bt. The updating method of B is shown in fig. 5:

step 501, acquiring the text content of the current iteration based on an attention mechanism;

step 502, obtaining Speaker ID of a Speaker, a last iteration output unit and current working memory Bt;

step 503, calculating the updated working memory unit u (the bold frame in the figure) according to the above four quantities;

and step 504, removing B [ k ] in B, translating the rest units to the right, and making B [1] be u to obtain new B.

Optionally, in the embodiment of the present invention, after the voice utterance model is determined, a voice feature vector may be obtained according to the text to be uttered, the identification information of the utterer who needs to utter, and the voice utterance model, and the voice feature vector is input to a playing device, such as a world vocoder, so that the text to be uttered may be played by using the voice of the utterer.

Optionally, in the embodiment of the present invention, in order to prevent the voice production model from being over-fitted, random noise complying with standard normal distribution may be added when determining the voice feature vector, so as to increase the diversity of the voice production model samples, so that the voice production model is not over-fitted, and the obtained voice feature vector is more accurate.

For example, as shown in fig. 6, performing speech synthesis includes several steps, respectively:

step 601, converting a string of Chinese texts to be generated into a series of corresponding phoneme sequences through the corresponding relation between initials, finals, tones and a phoneme table;

step 602, generating a section of random noise which follows standard normal distribution;

step 603, inputting the random noise, the id of the speaker to be simulated and the phoneme sequence into the trained neural network module, and outputting a string of audio acoustic features close to the specified speaker by the trained neural network module;

step 604, inputting the acoustic characteristics of the audio into the WORLD vocoder, wherein the vocoder outputs a section of audio, and the content of the audio is a string of input Chinese texts, and the pronunciation is the pronunciation of a specified speaker.

Based on the same concept, an embodiment of the present invention further provides a speech synthesis apparatus, as shown in fig. 7, including:

a phoneme sequence determining unit 701, configured to determine a phoneme sequence corresponding to text information to be uttered; the phoneme sequence comprises a plurality of phoneme information, and the sequencing of each phoneme information is consistent with the sequencing of each character in the text information; the phoneme information comprises initials, finals and tones of characters corresponding to the phoneme information;

a speech feature vector determining unit 702, configured to input the phoneme sequence into a speech model, and determine a speech feature vector corresponding to the text information, where the speech model is obtained by performing neural network training on a speech sample; the voice feature vector is used for playing through a playing device.

Further, the speech feature vector determining unit 702 is specifically configured to:

Further, the phoneme sequence determining unit 701 is further configured to:

the speech feature vector determination unit is specifically configured to:

Based on the same principle, the present invention also provides an electronic device, as shown in fig. 8, including:

the system comprises a processor 801, a memory 802, a transceiver 803 and a bus interface 804, wherein the processor 801, the memory 802 and the transceiver 803 are connected through the bus interface 804;

the processor 801 is configured to read the program in the memory 802, and execute the following method:

determining a phoneme sequence corresponding to text information to be sounded; the phoneme sequence comprises a plurality of phoneme information, and the sequencing of each phoneme information is consistent with the sequencing of each character in the text information; the phoneme information comprises initials, finals and tones of characters corresponding to the phoneme information;

Further, the processor 801 is specifically configured to:

Further, the processor 801 acquires random noise through the transceiver 803, wherein the random noise conforms to a normal distribution;

the processor 801 is specifically configured to:

The present application provides a computer program product including a computer program stored on a non-transitory computer readable storage medium, the computer program including program instructions which, when executed by a computer, cause the computer to perform any one of the above data query methods.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method of speech synthesis, the method comprising:

2. The method of claim 1, wherein said inputting said sequence of phonemes into a speech utterance model and determining a speech feature vector corresponding to said text information comprises:

3. The method of claim 2, wherein said inputting speech parameters and said sequence of phonemes into said speech utterance model further comprises:

4. The method of any one of claims 1 to 3, wherein the speech feature vectors include a 60-dimensional Mel generalized cepstral vector, a 1-dimensional aperiodic banded feature vector, a 1-dimensional log fundamental feature vector, and a 1-dimensional unvoiced feature vector.

5. The method of claim 4, wherein the speech-sounding model is a neural network model of a working memory mechanism.

6. A speech synthesis apparatus, comprising:

7. The apparatus of claim 6, wherein the speech feature vector determination unit is specifically configured to:

8. The apparatus of claim 7, wherein the phoneme sequence determination unit is further to:

the speech feature vector determination unit is specifically configured to:

9. The apparatus according to any one of claims 6 to 8, wherein the speech feature vectors include a 60-dimensional Mel generalized cepstral vector, a 1-dimensional aperiodic banded feature vector, a 1-dimensional log fundamental frequency feature vector, and a 1-dimensional unvoiced and voiced feature vector.

10. The apparatus of claim 9, wherein the speech production model is a neural network model of a working memory mechanism.

11. An electronic device, comprising:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.

12. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 5.