CN113870838A

CN113870838A - Voice synthesis method, device, equipment and medium

Info

Publication number: CN113870838A
Application number: CN202111138183.XA
Authority: CN
Inventors: 张旭龙; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-09-27
Filing date: 2021-09-27
Publication date: 2021-12-31

Abstract

The application relates to an artificial intelligence technology and provides a speech synthesis method, a speech synthesis device, speech synthesis equipment and a speech synthesis medium. The method comprises the following steps: performing text analysis on the target text information through the trained speech synthesis model to obtain a phoneme sequence of the target text information; carrying out phoneme mapping coding on each first phoneme in the phoneme sequence of the target text information to obtain coding information of each first phoneme; performing sequence extension on the coding information of each first phoneme according to the phoneme duration of each first phoneme to obtain the extension coding information of each first phoneme; predicting the acoustic characteristics of the current frame according to the extended coding information of each first phoneme and the acoustic characteristics of the previous frame; and synthesizing predicted voice information corresponding to the target text information according to the acoustic characteristics of all frames in the voice to be synthesized corresponding to the acquired target text information, and performing time alignment on the phoneme sequence and the acoustic characteristics to improve the alignment accuracy and further improve the reliability of voice synthesis.

Description

Voice synthesis method, device, equipment and medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a medium for speech synthesis.

Background

Conventional Speech synthesis models are usually based on attention mechanism To align phoneme sequences and acoustic features in time, for example, Text To Speech (TTS) models are based on attention mechanism To sequence Speech synthesis models, but alignment based on attention mechanism is likely To be wrong, resulting in low reliability of synthesized Speech information.

Disclosure of Invention

The embodiment of the application provides a speech synthesis method, a speech synthesis device, speech synthesis equipment and a speech synthesis medium, which can align time of a phoneme sequence and acoustic features, improve alignment accuracy and further improve the reliability of speech synthesis.

In one aspect, an embodiment of the present application provides a speech synthesis method, where the method includes:

inputting the target text information into the trained voice synthesis model, and performing text analysis on the target text information through the trained voice synthesis model to obtain a phoneme sequence of the target text information;

carrying out phoneme mapping coding on each first phoneme in the phoneme sequence of the target text information to obtain coding information of each first phoneme;

performing sequence extension on the coding information of each first phoneme according to the phoneme duration of each first phoneme to obtain the extension coding information of each first phoneme;

predicting the acoustic features of the current frame according to the extended coding information of each first phoneme and the acoustic features of the previous frame in the speech to be synthesized corresponding to the target text information until the acoustic features of all frames in the speech to be synthesized corresponding to the target text information are obtained;

and synthesizing the predicted voice information corresponding to the target text information according to the acoustic characteristics of all frames in the voice to be synthesized corresponding to the acquired target text information.

In an embodiment, the specific implementation process of performing sequence extension on the coding information of each first phoneme according to the phoneme duration of each first phoneme to obtain the extended coding information of each first phoneme is as follows:

acquiring a sampling rate of the voice to be synthesized corresponding to the target text information;

and performing sequence extension on the coding information of each first phoneme according to the phoneme duration of each first phoneme and the sampling rate to obtain the extended coding information of each first phoneme.

In an embodiment, the specific implementation process of performing sequence extension on the coding information of each first phoneme according to the phoneme duration of each first phoneme and the sampling rate to obtain the extended coding information of each first phoneme is as follows:

multiplying the sampling rate by the phoneme duration of each first phoneme to obtain an expansion factor of each first phoneme;

and combining the coding information of the first phonemes with the expansion factor to form the expansion coding information of the first phonemes.

In one embodiment, the trained speech synthesis model includes a phoneme duration prediction model;

before performing sequence extension on the coding information of each first phoneme according to the phoneme duration of each first phoneme to obtain the extended coding information of each first phoneme, the following process may be further implemented:

and inputting the phoneme sequence of the target text information into the phoneme duration prediction model, and performing phoneme duration prediction on each first phoneme in the phoneme sequence of the target text information through the phoneme duration prediction model to obtain the phoneme duration of each first phoneme.

In an embodiment, predicting the acoustic features of the current frame according to the extension coding information of each first phoneme and the acoustic features of the previous frame in the speech to be synthesized corresponding to the target text information, until the acoustic features of all frames in the speech to be synthesized corresponding to the target text information are obtained, specifically implementing the following steps:

predicting the acoustic characteristics of the current frame according to the extended coding information of each first phoneme and the acoustic characteristics of the previous frame in the speech to be synthesized corresponding to the target text information;

if a termination identifier for indicating the end of the target text message is obtained, determining to obtain the acoustic characteristics of all frames in the voice to be synthesized corresponding to the target text message;

if the termination identifier for indicating the end of the target text information is not obtained, predicting the acoustic feature of the next frame according to the extended coding information of each first phoneme and the acoustic feature of the current frame in the speech to be synthesized corresponding to the target text information until the termination identifier for indicating the end of the target text information is obtained.

In one embodiment, the trained speech synthesis model includes a target encoder and a target decoder;

the specific implementation process of performing phoneme mapping coding on each first phoneme in the phoneme sequence of the target text information to obtain the coding information of each first phoneme is as follows:

performing phoneme mapping coding on each first phoneme in a phoneme sequence of the target text information through a target coder to obtain coding information of each first phoneme;

synthesizing predicted voice information corresponding to the target text information according to the acoustic characteristics of all frames in the voice to be synthesized corresponding to the acquired target text information, wherein the method comprises the following steps:

and performing voice synthesis on the acoustic features of all frames in the voice to be synthesized corresponding to the obtained target text information through a target decoder to obtain predicted voice information corresponding to the target text information.

In one embodiment, the specific implementation process of the training method of the speech synthesis model is as follows:

acquiring a training sample, wherein the training sample comprises training text information and training voice information corresponding to the training text information;

performing text analysis on the training text information through a speech synthesis model to obtain a phoneme sequence of the training text information;

performing phoneme mapping coding on each second phoneme in the phoneme sequence of the training text information to obtain coding information of each second phoneme;

performing sequence extension on the coding information of each second phoneme according to the phoneme duration of each second phoneme to obtain the extension coding information of each second phoneme;

predicting the acoustic features of the current frame according to the extended coding information of each second phoneme and the acoustic features of the previous frame in the speech to be synthesized corresponding to the training text information until the acoustic features of all frames in the speech to be synthesized corresponding to the training text information are obtained;

synthesizing predicted voice information corresponding to the training text information according to the acoustic characteristics of all frames in the voice to be synthesized corresponding to the acquired training text information;

and training the voice synthesis model according to the predicted voice information and the training voice information corresponding to the training text information to obtain the trained voice synthesis model.

In another aspect, an embodiment of the present application provides a speech synthesis apparatus, including:

the input unit is used for inputting the target text information into the trained speech synthesis model;

the processing unit is used for carrying out text analysis on the target text information through the trained speech synthesis model to obtain a phoneme sequence of the target text information;

the processing unit is further used for carrying out phoneme mapping coding on each first phoneme in the phoneme sequence of the target text information to obtain coding information of each first phoneme;

the processing unit is further used for performing sequence extension on the coding information of each first phoneme according to the phoneme duration of each first phoneme to obtain the extended coding information of each first phoneme;

the processing unit is further configured to predict the acoustic features of the current frame according to the extended coding information of each first phoneme and the acoustic features of the previous frame in the speech to be synthesized corresponding to the target text information until the acoustic features of all frames in the speech to be synthesized corresponding to the target text information are obtained;

and the processing unit is further used for synthesizing the predicted voice information corresponding to the target text information according to the acoustic features of all frames in the voice to be synthesized corresponding to the obtained target text information.

In another aspect, an embodiment of the present application provides an electronic device, which includes a processor, a memory, and a communication interface, where the processor, the memory, and the communication interface are connected to each other, where the memory is used to store a computer program that supports a terminal to execute the foregoing method, the computer program includes program instructions, and the processor is configured to call the program instructions, and perform the following steps: inputting the target text information into the trained voice synthesis model, and performing text analysis on the target text information through the trained voice synthesis model to obtain a phoneme sequence of the target text information; carrying out phoneme mapping coding on each first phoneme in the phoneme sequence of the target text information to obtain coding information of each first phoneme; performing sequence extension on the coding information of each first phoneme according to the phoneme duration of each first phoneme to obtain the extension coding information of each first phoneme; predicting the acoustic features of the current frame according to the extended coding information of each first phoneme and the acoustic features of the previous frame in the speech to be synthesized corresponding to the target text information until the acoustic features of all frames in the speech to be synthesized corresponding to the target text information are obtained; and synthesizing the predicted voice information corresponding to the target text information according to the acoustic characteristics of all frames in the voice to be synthesized corresponding to the acquired target text information.

In yet another aspect, embodiments of the present application provide a computer-readable storage medium storing a computer program, the computer program comprising program instructions that, when executed by a processor, cause the processor to execute the above-mentioned speech synthesis method.

In the embodiment of the application, after the phoneme sequence of the target text information is obtained, phoneme mapping coding is performed on each first phoneme in the phoneme sequence of the target text information to obtain coding information of each first phoneme, sequence extension is performed on the coding information of each first phoneme according to the phoneme duration of each first phoneme to obtain extension coding information of each first phoneme, the acoustic features of a current frame are predicted according to the extension coding information of each first phoneme and the acoustic features of a previous frame in the speech to be synthesized corresponding to the target text information until the acoustic features of all frames in the speech to be synthesized corresponding to the target text information are obtained, time alignment can be performed on the phoneme sequence and the acoustic features, and alignment accuracy is improved. Furthermore, according to the acoustic characteristics of all frames in the speech to be synthesized corresponding to the acquired target text information, the predicted speech information corresponding to the target text information is synthesized, and the reliability of speech synthesis can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic architecture diagram of a speech synthesis system according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a speech synthesis method provided in an embodiment of the present application;

FIG. 3 is a flowchart illustrating a method for training a speech synthesis model according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

According to the method and the device, the phoneme duration of the phoneme sequence obtained through prediction can be used for guaranteeing the time alignment of the acoustic features and the phoneme sequence, the problem that an alignment scheme based on an attention mechanism is unstable is solved, the problems of word missing, repetition and the like in speech synthesis are avoided, and the reliability of speech synthesis is further improved.

The language synthesis method in the embodiment of the application can be applied to first electronic equipment, wherein the first electronic equipment can be any one or more of a smart phone, a tablet computer, a notebook computer, a desktop computer, an intelligent vehicle-mounted device and an intelligent wearable device.

In one example, the first electronic device runs with a reading client, the reading client provides a dictation function, and if a user submits a dictation instruction to a certain text message (such as a novel or poem), the first electronic device may acquire the text message after detecting the dictation instruction and perform the speech synthesis method disclosed in the embodiment of the present application.

In another example, an instant messaging client is operated on a first electronic device, in a scene where a user is inconvenient to browse the device, such as driving or in a bumpy environment, a certain session interface in the instant messaging client includes at least one piece of text information, if the user needs to convert certain text information into voice, the user can submit a voice conversion instruction to the text information, and after the voice conversion instruction is detected by the first electronic device, the first electronic device can acquire the text information and execute the voice synthesis method disclosed in the embodiment of the present application.

In another example, the first electronic device runs with an intelligent customer service client, and when a user interacts with the intelligent customer service client in the first electronic device, if the user submits interaction information (the type of the interaction information may be text or voice) to the intelligent customer service client through the first electronic device, the intelligent customer service client may determine text information to be output to the user based on the interaction information, and execute the voice synthesis method disclosed in the embodiment of the present application.

The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.

The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Referring to fig. 1, fig. 1 is a schematic diagram of an architecture of a speech synthesis system according to an embodiment of the present application. As shown in fig. 1, the overall process of the system starts from the text input requiring speech synthesis, and if the text information requiring speech synthesis is the target text information, the target text information is subjected to text parsing to obtain a phoneme sequence, where the processing procedure of each phoneme in the phoneme sequence is as follows: and obtaining a duration extension coding representation of each phoneme in the phoneme sequence according to the pre-trained phoneme duration prediction model, and using the duration extension coding representation as the input of a decoder. The decoder is the autoregressive decoding from the sequence to the sequence, the frame-level acoustic features output at the last moment can also be used as the input of the decoder, the predicted acoustic features can be obtained through the decoder, whether the terminator of the end of the target text information is reached or not is judged through the decoder, if the terminator of the end of the target text information is reached, all the acoustic features are input into the vocoder to synthesize corresponding waveform audio, and the predicted voice information corresponding to the target text information is obtained. And if the terminator of the end of the target text message is not reached, the frame-level acoustic features output at the last moment and the duration extension coding representation of the phoneme at the current moment are used as the input of a decoder, and the predicted acoustic features are obtained by the decoder until the terminator of the end of the target text message is reached.

As a possible implementation, the decoder of the speech synthesis system is a tacontron decoder, but no tacontron encoder is used, i.e. the encoder in the speech synthesis system is a different encoder than the tacontron encoder. The speech synthesis system does not use a tacontron encoder, so that repeated frame-level features in learned hidden variables can be prevented from being input into a tacontron decoder, the estimated phoneme duration is used for guaranteeing the time alignment, and the wrong alignment of a position-based attention mechanism between the tacontron encoder and the tacontron decoder is avoided.

Referring to fig. 2, fig. 2 is a schematic flowchart of a speech synthesis method according to an embodiment of the present application; the speech synthesis method as shown in fig. 1 may be performed by a first electronic device, and the scheme includes, but is not limited to, step S201-step S205, wherein:

s201, inputting the target text information into the trained voice synthesis model, and performing text analysis on the target text information through the trained voice synthesis model to obtain a phoneme sequence of the target text information.

The trained speech synthesis model may include a word-to-Phoneme (G2P) module, and the first electronic device may perform word segmentation processing on the target text information through the trained speech synthesis model to obtain a text character string, and then convert the text character string into a Phoneme sequence through the G2P module.

Wherein the phoneme sequence may comprise one or more phonemes (phones). The phoneme is the smallest phonetic unit divided according to the natural attributes of the speech, and from the viewpoint of acoustic properties, the phoneme is the smallest phonetic unit divided from the viewpoint of sound quality, and from the viewpoint of physiological properties, one pronunciation action forms one phoneme. For example, "ma" includes two pronunciation actions of "m" and "a", and is two phonemes. The sounds uttered by the same pronunciation action are the same phoneme, and the sounds uttered by different pronunciation actions are different phonemes. For example, in "ma-mi", two "m" pronunciation actions are the same and are the same phoneme, and "a" and "i" pronunciation actions are different and are different phonemes. The analysis of phonemes is generally described in terms of pronunciation actions. The pronunciation actions like "m" are: the upper and lower lips are closed, the vocal cords vibrate, and the airflow flows out of the nasal cavity to make sound.

For example, if the target text message is "wish you happy birthday", the first electronic device performs text parsing on the target text message through the trained speech synthesis model to obtain that the phoneme sequence of the text message is "zhhunishengrikuaile", where the phoneme sequence includes 18 phonemes.

The G2P module uses a Recurrent Neural Network (RNN) and a long-short term memory network (LSTM) to realize the conversion from english words to phonemes.

S202, performing phoneme mapping coding on each first phoneme in the phoneme sequence of the target text information to obtain coding information of each first phoneme.

Any phoneme included in the phoneme sequence of the target text information is a first phoneme, and the first electronic device may perform phoneme mapping coding on each first phoneme in the phoneme sequence of the target text information through a phoneme coder to obtain coding information of each first phoneme. The phoneme coder can code the phoneme sequence according to the cycle parameter, the amplitude parameter and the spectrum parameter to obtain the coding information of each first phoneme.

Compared with the traditional Tacotron model, the trained speech synthesis model in the embodiment of the application does not use a Tacotron coder, and directly carries out phoneme mapping coding on each phoneme in the phoneme sequence to obtain coding information. For example, the phoneme a can be encoded into an integer 0, the first electronic device performs phoneme mapping encoding on the phoneme "a" in the phoneme sequence through the trained speech synthesis model to obtain the encoding information "0" of the phoneme "a".

S203, according to the phoneme duration of each first phoneme, performing sequence extension on the coding information of each first phoneme to obtain the extension coding information of each first phoneme.

In a possible embodiment, the trained speech synthesis model may include a phoneme duration prediction model, and the first electronic device may perform phoneme duration prediction on each phoneme through the phoneme duration prediction model to obtain a phoneme duration of each phoneme. The phoneme duration prediction model has the function of inputting a phoneme sequence and outputting a duration estimation value (namely phoneme duration) corresponding to each phoneme in the phoneme sequence. The phoneme duration prediction model is pre-trained, and is not trained in the training process of the speech synthesis model.

The first electronic device may perform sequence extension on the coding information of each phoneme through the trained speech synthesis model to obtain extended coding information of each phoneme. For example, the first electronic device may obtain a sampling rate of the speech to be synthesized corresponding to the text information, and if the sampling rate is 8000 hertz (Hz) and the duration of the phoneme a is predicted to be 0.3 seconds, the first electronic device may expand the coding 0 of the phoneme a into 2400 0 s through the trained speech synthesis model, that is, the expanded coding information of the phoneme a is 2400 0 s.

The purpose of phoneme duration prediction is to temporally align the input linguistic feature phoneme sequence with the acoustic feature spectrum, for example, the number of acoustic features is usually as many as 200 frames, while the corresponding number of phonemes may be 10, and the alignment is completed by extending 10 to a total size of 200 using temporal prediction. Meanwhile, the number of frames corresponding to each phoneme is different.

And S204, predicting the acoustic features of the current frame according to the extension coding information of each first phoneme and the acoustic features of the previous frame in the speech to be synthesized corresponding to the target text information until the acoustic features of all frames in the speech to be synthesized corresponding to the target text information are obtained.

In one possible embodiment, the decoder in the trained speech synthesis model may be a tocotron decoder in a conventional tocotron model, such as an RNN-based decoder. In the embodiment of the application, the tacontron decoder is sequence-to-sequence autoregressive decoding, the frame-level acoustic features output at the previous time (i.e. the acoustic features of the previous frame) are also used as input of the tacontron decoder, and the tacontron decoder can obtain the predicted acoustic features of the current frame and whether a terminator for ending text information is reached. The first electronic device may predict the acoustic feature of the current frame by using a decoder for the acoustic feature of the previous frame in the speech to be synthesized corresponding to the extension coding information and the text information of each phoneme, so that the phoneme sequence and the acoustic feature spectrum are aligned in time. Wherein, the current frame is the next frame of the frame to which the recently acquired acoustic features belong. For example, the first electronic device recently acquires the acoustic features of the fifth frame through the decoder, and then the decoder may predict the acoustic features of the sixth frame according to the extension encoding information of each phoneme and the acoustic features of the fifth frame.

Here, the acoustic feature refers to a physical quantity representing acoustic characteristics of speech, and is also a generic term for acoustic representation of sound elements. Such as energy concentration zones representing timbre, formant frequency, formant intensity and bandwidth, and duration, fundamental frequency, average speech power, etc., representing speech prosodic characteristics.

After the phoneme sequence is obtained, the purpose of performing sequence extension on each phoneme by predicting the phoneme duration of each phoneme in the phoneme sequence and according to the phoneme duration of each phoneme is as follows: in order to time align the input linguistic feature phoneme sequence and the acoustic feature spectrum, for example, the number of the acoustic features is usually as much as 200 frames, and the number of the corresponding phonemes may be 10, the time alignment is completed by extending 10 to a total size of 200 by using the time prediction. The phoneme duration of each phoneme in the phoneme sequence is different, and then the frame number corresponding to each phoneme is also different.

And S205, synthesizing predicted voice information corresponding to the target text information according to the acoustic features of all frames in the voice to be synthesized corresponding to the acquired target text information.

If the decoder determines that the terminator of the end of the text message is reached, it may be determined that the acoustic features of all frames in the speech to be synthesized are obtained, and then the first electronic device may synthesize the predicted speech information corresponding to the target text message according to the acoustic features of all frames, for example, the decoder may input the acoustic features of all frames to the vocoder, and synthesize the corresponding waveform audio, that is, the predicted speech information, through the vocoder.

Referring to fig. 3, fig. 3 is a schematic flowchart illustrating a method for training a speech synthesis model according to an embodiment of the present disclosure; the method for training the speech synthesis model shown in fig. 3 may be performed by the second electronic device, and includes, but is not limited to, steps S301 to S307, wherein:

s301, training samples are obtained, and the training samples comprise training text information and training voice information corresponding to the training text information.

The second electronic device may obtain a training sample, where the training speech information in the training sample may be a single piece of training speech information, that is, audio data input by one user, for example, audio data input by one user about "wish you happy birthday", and then the second electronic device may use the audio data as the training speech information, and the training text information corresponding to the training speech information may be "wish you happy birthday".

It is understood that the training sample may be input to the second electronic device by the user, for example, the second electronic device collects training voice information through a microphone, and collects training text information corresponding to the training voice information through an input device (e.g., a touch panel or a keyboard) of the second electronic device. Optionally, the training sample may also be obtained by the second electronic device from a local storage, or obtained by the second electronic device from another device, or obtained by downloading the second electronic device through the internet, which is not limited by the embodiment of the present application.

The second electronic device can be any one or more of a smart phone, a tablet computer, a notebook computer, a desktop computer, an intelligent vehicle-mounted device and an intelligent wearable device. Optionally, the second electronic device may also be a server, and the server may be an independent physical server, or a server cluster or a distributed system formed by a plurality of physical servers. That is, the server may be an independent server, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), and a big data and artificial intelligence platform. S302, performing text analysis on the training text information through the speech synthesis model to obtain a phoneme sequence of the training text information.

And S303, performing phoneme mapping coding on each second phoneme in the phoneme sequence of the training text information to obtain coding information of each second phoneme.

Any phoneme included in the phoneme sequence of the training text information is a second phoneme, and a manner of performing phoneme mapping coding on each second phoneme in the phoneme sequence of the training text information in the embodiment of the present application is the same as a manner of performing phoneme mapping coding on each first phoneme in the phoneme sequence of the target text information by the first electronic device, which may specifically refer to the description in step S202, and is not described in detail in the embodiment of the present application.

S304, according to the phoneme duration of each second phoneme, performing sequence extension on the coding information of each second phoneme to obtain the extension coding information of each second phoneme.

The method for performing sequence extension on the coding information of each second phoneme by the second electronic device according to the phoneme duration of each second phoneme to obtain the extended coding information of each second phoneme may be: the second electronic device determines a sampling rate, i.e., how many sampling points are in one second to represent the time-domain waveform of the audio, according to the speech spectrum of the training speech information in the training sample. The second electronic device may then multiply the sampling rate and the phoneme durations of the respective second phonemes to obtain the extension factors of the second phonemes, and compose the codes of the second phonemes of the number of the extension factors into extension coding information of the second phonemes. The current training samples for speech synthesis usually have uniform sampling rate to 8000Hz or 16000 Hz.

S305, predicting the acoustic features of the current frame according to the extended coding information of each second phoneme and the acoustic features of the previous frame in the speech to be synthesized corresponding to the training text information until the acoustic features of all frames in the speech to be synthesized corresponding to the training text information are obtained.

In the embodiment of the present application, according to the extended coding information of each second phoneme and the acoustic feature of the previous frame in the speech to be synthesized corresponding to the training text information, the manner of predicting the acoustic feature of the current frame is the same as the manner of predicting the acoustic feature of the current frame by the first electronic device according to the extended coding information of each first phoneme and the acoustic feature of the previous frame in the speech to be synthesized corresponding to the target text information, which may specifically refer to the description in step S204, and this embodiment is not described again.

And S306, synthesizing the predicted voice information corresponding to the training text information according to the acquired acoustic features of all frames in the voice to be synthesized corresponding to the training text information.

According to the acoustic features of all frames in the speech to be synthesized corresponding to the obtained training text information, the manner of synthesizing the predicted speech information corresponding to the training text information is the same as the manner of synthesizing the predicted speech information corresponding to the target text information by the first electronic device according to the acoustic features of all frames in the speech to be synthesized corresponding to the obtained target text information, which may be specifically referred to the description of step S205, and is not repeated in this embodiment.

S307, training the voice synthesis model according to the predicted voice information and the training voice information corresponding to the training text information to obtain the trained voice synthesis model.

In a specific implementation, the second electronic device may compare the synthesized speech information with speech information in a training sample to obtain a loss value, and train the speech synthesis model according to the loss value to obtain a trained speech synthesis model.

In the embodiment of the application, a speech synthesis model is used for performing text parsing on training text information to obtain a phoneme sequence of the training text information, phoneme mapping coding is performed on each second phoneme in the phoneme sequence of the training text information to obtain coding information of each second phoneme, sequence expansion is performed on the coding information of each second phoneme according to the phoneme duration of each second phoneme to obtain expanded coding information of each second phoneme, the acoustic features of a current frame are predicted according to the expanded coding information of each second phoneme and the acoustic features of a previous frame in the speech to be synthesized corresponding to the training text information until the acoustic features of all frames in the speech to be synthesized corresponding to the training text information are obtained, and the predicted speech information corresponding to the training text information is synthesized according to the acoustic features of all frames in the speech to be synthesized corresponding to the training text information, and training the voice synthesis model according to the predicted voice information and the training voice information corresponding to the training text information to obtain the trained voice synthesis model, and performing time alignment on the phoneme sequence and the acoustic feature through the trained voice synthesis model to improve the alignment accuracy and further improve the reliability of voice synthesis.

The embodiment of the present application further provides a computer storage medium, in which program instructions are stored, and when the program instructions are executed, the computer storage medium is used for implementing the corresponding method described in the above embodiment.

Referring to fig. 4 again, fig. 4 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present application.

In one implementation of the apparatus of the embodiment of the application, the apparatus includes the following structure.

An input unit 401 configured to input target text information to the trained speech synthesis model;

a processing unit 402, configured to perform text parsing on the target text information through the trained speech synthesis model to obtain a phoneme sequence of the target text information;

the processing unit 402 is further configured to perform phoneme mapping coding on each first phoneme in the phoneme sequence of the target text information to obtain coding information of each first phoneme;

the processing unit 402 is further configured to perform sequence extension on the coding information of each first phoneme according to the phoneme duration of each first phoneme to obtain extended coding information of each first phoneme;

the processing unit 402 is further configured to predict the acoustic features of the current frame according to the extension coding information of each first phoneme and the acoustic features of the previous frame in the speech to be synthesized corresponding to the target text information until the acoustic features of all frames in the speech to be synthesized corresponding to the target text information are obtained;

the processing unit 402 is further configured to synthesize predicted speech information corresponding to the target text information according to the acoustic features of all frames in the speech to be synthesized corresponding to the obtained target text information.

In one embodiment, the sequence extension of the coding information of each first phoneme according to the phoneme duration of each first phoneme by the processing unit 402 to obtain the extended coding information of each first phoneme includes:

and performing sequence extension on the coding information of each first phoneme according to the phoneme duration and the sampling rate of each first phoneme to obtain the extended coding information of each first phoneme.

In one implementation, the processing unit 402 performs sequence extension on the coding information of each first phoneme according to the phoneme duration and the sampling rate of each first phoneme to obtain extended coding information of each first phoneme, including:

before the processing unit 402 performs sequence extension on the coding information of each first phoneme according to the phoneme duration of each first phoneme to obtain the extended coding information of each first phoneme, the method further includes:

inputting the phoneme sequence of the target text information into a phoneme duration prediction model, and performing phoneme duration prediction on each first phoneme in the phoneme sequence of the target text information through the phoneme duration prediction model to obtain the phoneme duration of each first phoneme.

In an embodiment, the predicting, by the processing unit 402, the acoustic features of the current frame according to the extension coding information of each first phoneme and the acoustic features of the previous frame in the speech to be synthesized corresponding to the target text information until the acoustic features of all frames in the speech to be synthesized corresponding to the target text information are obtained includes:

if a termination identifier for indicating the end of the target text information is obtained, determining to obtain the acoustic characteristics of all frames in the voice to be synthesized corresponding to the target text information;

the processing unit 402 performs phoneme mapping coding on each first phoneme in the phoneme sequence of the target text information to obtain coding information of each first phoneme, including:

the processing unit 402 synthesizes predicted speech information corresponding to the target text information according to the acoustic features of all frames in the speech to be synthesized corresponding to the acquired target text information, including:

In one embodiment, the apparatus further comprises:

an obtaining unit 403, configured to obtain a training sample, where the training sample includes training text information and training speech information corresponding to the training text information;

the processing unit 402 is further configured to perform text parsing on the training text information through a speech synthesis model to obtain a phoneme sequence of the training text information;

the processing unit 402 is further configured to perform phoneme mapping coding on each second phoneme in the phoneme sequence of the training text information to obtain coding information of each second phoneme;

the processing unit 402 is further configured to perform sequence extension on the coding information of each second phoneme according to the phoneme duration of each second phoneme to obtain extended coding information of each second phoneme;

the processing unit 402 is further configured to predict the acoustic features of the current frame according to the extended coding information of each second phoneme and the acoustic features of the previous frame in the speech to be synthesized corresponding to the training text information until the acoustic features of all frames in the speech to be synthesized corresponding to the training text information are obtained;

the processing unit 402 is further configured to synthesize predicted speech information corresponding to the training text information according to the acoustic features of all frames in the speech to be synthesized corresponding to the acquired training text information;

the processing unit 402 is further configured to train the speech synthesis model according to the predicted speech information and the training speech information corresponding to the training text information, so as to obtain a trained speech synthesis model.

Referring to fig. 5 again, fig. 5 is a schematic structural diagram of an electronic device provided in the embodiment of the present application, where the electronic device in the embodiment of the present application includes a power supply module and the like, and includes a processor 501, a memory 502, and a communication interface 503. Data can be exchanged between the processor 501, the memory 502 and the communication interface 503, and the processor 501 implements a corresponding data processing scheme.

Memory 502 may include volatile memory (volatile memory), such as random-access memory (RAM); the memory 502 may also include a non-volatile memory (non-volatile memory), such as a flash memory (flash memory), a solid-state drive (SSD), etc.; the memory 502 may also comprise a combination of memories of the kind described above.

The processor 501 may be a Central Processing Unit (CPU) 501. The processor 501 may also be a combination of a CPU and a GPU. In the electronic device, a plurality of CPUs and GPUs may be included as necessary to perform corresponding data processing. In one embodiment, memory 502 is used to store program instructions. The processor 501 may invoke program instructions to implement the various methods as described above in the embodiments of the present application.

In a first possible implementation, the processor 501 of the electronic device calls program instructions stored in the memory 502 for performing the following operations:

inputting target text information into the trained speech synthesis model;

performing text analysis on the target text information through the trained speech synthesis model to obtain a phoneme sequence of the target text information;

In an embodiment, the processor 501 is specifically configured to perform the following operations when performing sequence extension on the coding information of each first phoneme according to the phoneme duration of each first phoneme to obtain extended coding information of each first phoneme:

In one implementation, the processor 501 is specifically configured to perform the following operations when performing sequence extension on the coding information of each first phoneme according to the phoneme duration and the sampling rate of each first phoneme to obtain extended coding information of each first phoneme:

before the processor 501 performs sequence extension on the coding information of each first phoneme according to the phoneme duration of each first phoneme to obtain the extended coding information of each first phoneme, the following operations are further performed:

In an embodiment, the processor 501, when predicting the acoustic feature of the current frame according to the extension coding information of each first phoneme and the acoustic feature of the previous frame in the speech to be synthesized corresponding to the target text information until obtaining the acoustic features of all frames in the speech to be synthesized corresponding to the target text information, is specifically configured to perform the following operations:

when performing phoneme mapping coding on each first phoneme in the phoneme sequence of the target text information to obtain coding information of each first phoneme, the processor 501 is specifically configured to perform the following operations:

when synthesizing the predicted speech information corresponding to the target text information according to the acoustic features of all frames in the speech to be synthesized corresponding to the acquired target text information, the processor 501 is specifically configured to perform the following operations:

In one embodiment, processor 501 is further configured to perform the following operations:

acquiring a training sample through a communication interface 503, wherein the training sample comprises training text information and training voice information corresponding to the training text information;

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like. The computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism and an encryption algorithm. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

While the invention has been described with reference to a number of embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method of speech synthesis, comprising:

inputting target text information into a trained voice synthesis model, and performing text analysis on the target text information through the trained voice synthesis model to obtain a phoneme sequence of the target text information;

performing phoneme mapping coding on each first phoneme in the phoneme sequence of the target text information to obtain coding information of each first phoneme;

performing sequence extension on the coding information of each first phoneme according to the phoneme duration of each first phoneme to obtain the extended coding information of each first phoneme;

and synthesizing the predicted voice information corresponding to the target text information according to the acquired acoustic features of all frames in the voice to be synthesized corresponding to the target text information.

2. The method of claim 1, wherein said sequence expanding the coding information of each first phoneme according to the phoneme duration of each first phoneme to obtain the expanded coding information of each first phoneme comprises:

acquiring the sampling rate of the voice to be synthesized corresponding to the target text information;

3. The method of claim 2, wherein said sequence extending the coding information of each first phoneme according to the phoneme duration and the sampling rate of each first phoneme to obtain the extended coding information of each first phoneme comprises:

multiplying the sampling rate and the phoneme duration of each first phoneme to obtain an expansion factor of each first phoneme;

and combining the coding information of the first phonemes with the number of the expansion factors into the expansion coding information of the first phonemes.

4. The method of claim 1, wherein the trained speech synthesis model comprises a phoneme duration prediction model;

before the performing sequence extension on the coding information of each first phoneme according to the phoneme duration of each first phoneme to obtain the extended coding information of each first phoneme, the method further includes:

inputting the phoneme sequence of the target text information into the phoneme duration prediction model, and performing phoneme duration prediction on each first phoneme in the phoneme sequence of the target text information through the phoneme duration prediction model to obtain the phoneme duration of each first phoneme.

5. The method according to claim 1, wherein the predicting the acoustic feature of the current frame according to the extension coding information of each first phoneme and the acoustic feature of the previous frame in the speech to be synthesized corresponding to the target text information until obtaining the acoustic features of all frames in the speech to be synthesized corresponding to the target text information includes:

if a termination identifier for indicating the end of the target text message is obtained, determining to obtain acoustic features of all frames in the speech to be synthesized corresponding to the target text message;

if the termination identifier for indicating the end of the target text information is not obtained, predicting the acoustic feature of the next frame according to the extension coding information of each first phoneme and the acoustic feature of the current frame in the speech to be synthesized corresponding to the target text information until the termination identifier for indicating the end of the target text information is obtained.

6. The method of claim 1, wherein the trained speech synthesis model comprises a target encoder and a target decoder;

performing phoneme mapping coding on each first phoneme in the phoneme sequence of the target text information to obtain coding information of each first phoneme, including:

performing phoneme mapping coding on each first phoneme in the phoneme sequence of the target text information through the target coder to obtain coding information of each first phoneme;

the synthesizing of the predicted speech information corresponding to the target text information according to the acquired acoustic features of all frames in the speech to be synthesized corresponding to the target text information includes:

and performing voice synthesis on the acoustic features of all frames in the voice to be synthesized corresponding to the obtained target text information through the target decoder to obtain predicted voice information corresponding to the target text information.

7. The method of claim 1, wherein the method of training the speech synthesis model comprises:

performing sequence extension on the coding information of each second phoneme according to the phoneme duration of each second phoneme to obtain the extended coding information of each second phoneme;

synthesizing predicted voice information corresponding to the training text information according to the acquired acoustic features of all frames in the voice to be synthesized corresponding to the training text information;

and training the voice synthesis model according to the predicted voice information corresponding to the training text information and the training voice information to obtain the trained voice synthesis model.

8. A speech synthesis apparatus, characterized in that the apparatus comprises:

the processing unit is used for performing text analysis on the target text information through the trained speech synthesis model to obtain a phoneme sequence of the target text information;

the processing unit is further configured to perform phoneme mapping coding on each first phoneme in the phoneme sequence of the target text information to obtain coding information of each first phoneme;

the processing unit is further configured to perform sequence extension on the coding information of each first phoneme according to the phoneme duration of each first phoneme to obtain extended coding information of each first phoneme;

the processing unit is further configured to predict the acoustic feature of the current frame according to the extension coding information of each first phoneme and the acoustic feature of the previous frame in the speech to be synthesized corresponding to the target text information until the acoustic features of all frames in the speech to be synthesized corresponding to the target text information are obtained;

the processing unit is further configured to synthesize predicted speech information corresponding to the target text information according to the obtained acoustic features of all frames in the speech to be synthesized corresponding to the target text information.

9. An electronic device comprising a processor, a memory and a communication interface, the processor, the memory and the communication interface being interconnected, wherein the memory is configured to store computer program instructions and the processor is configured to execute the program instructions to implement the speech synthesis method of any of claims 1-7.

10. A computer-readable storage medium having computer program instructions stored therein, which when executed by a processor, are configured to perform the speech synthesis method of any one of claims 1-7.