CN113903326A

CN113903326A - Speech synthesis method, apparatus, device and storage medium

Info

Publication number: CN113903326A
Application number: CN202111138381.6A
Authority: CN
Inventors: 倪子凡; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-09-27
Filing date: 2021-09-27
Publication date: 2022-01-07

Abstract

The application provides a voice synthesis method, a device, equipment and a storage medium for the technical field of artificial intelligent voice synthesis, wherein the method comprises the following steps: identifying a phoneme sequence contained in the text, and extracting context information from the phoneme sequence; performing length matching on the phoneme sequence and a preset Mel frequency spectrum according to the context information, and judging whether the phoneme sequence needs to be expanded according to a matching result; if so, preprocessing the text, determining alignment information corresponding to the text, expanding the phoneme sequence based on the alignment information until the length of the phoneme sequence is consistent with that of the preset Mel frequency spectrum, and obtaining a target phoneme sequence; and synthesizing the voice corresponding to the text according to the target phoneme sequence. According to the method and the device, the phoneme sequence length is extended according to the context information of the phoneme sequence in the recognition text, so that the synthesized voice has the sense of reality of suppressing the Yangtze frustration, and the voice synthesis effect is improved.

Description

Speech synthesis method, apparatus, device and storage medium

Technical Field

The present application relates to the field of artificial intelligence speech synthesis technology, and in particular, to a speech synthesis method, apparatus, device and storage medium.

Background

Speech is one of the most important tools for human interaction, and speech signal processing has been in the past decades as an important research field. Human speech not only contains character symbol information, but also contains the changes of emotion and emotion of people. In modern speech signal processing, analyzing and processing emotional characteristics in speech signals, and judging and simulating the joy, anger, sadness and the like of speakers are important research subjects.

Among them, speech synthesis, as an important branch in natural language processing technology, has also entered a new development stage with the gradual maturity of technology. Speech synthesis is widely used in scenes such as robots and speech assistants to simulate the effect of a natural person and a user's conversation.

However, the existing speech synthesis technology simply converts the words of the text into standard machine speech, which is different from the natural language of the real person, and has poor synthesis effect.

Disclosure of Invention

The present application provides a method, an apparatus, a device and a storage medium for speech synthesis, so as to improve the speech synthesis effect and make the synthesized speech closer to the sound of a real person.

In order to achieve the above object, the present application provides a speech synthesis method, which includes the steps of:

identifying a phoneme sequence contained in a text, and extracting context information from the phoneme sequence;

matching the length of the phoneme sequence with a preset Mel frequency spectrum according to the context information, and judging whether the phoneme sequence needs to be expanded according to a matching result;

if so, preprocessing the text, determining alignment information corresponding to the text, expanding the phoneme sequence based on the alignment information until the length of the phoneme sequence is consistent with that of the preset Mel frequency spectrum, and obtaining a target phoneme sequence; the alignment information represents the alignment relation between the speech to be synthesized and the text;

and synthesizing the voice corresponding to the text according to the target phoneme sequence.

Preferably, the context information includes position information of each phoneme of the phoneme sequence in the phoneme sequence, and the length matching the phoneme sequence with a preset mel spectrum according to the context information includes:

determining the pronunciation of each phoneme according to the position information and generating a pronunciation frequency spectrum of each phoneme;

splicing the pronunciation frequency spectrum of each phoneme to generate a frequency spectrum of the phoneme sequence to obtain a target frequency spectrum;

matching the target frequency spectrum with a preset Mel frequency spectrum in length; the preset Mel frequency spectrum acquisition method comprises the following steps:

and acquiring a voice fragment generated after the professional reads the text, generating a sound spectrum based on the voice fragment, and taking the sound spectrum as the preset Mel spectrum.

Preferably, the extending the phoneme sequence based on the alignment information includes:

determining a time interval between two adjacent phonemes in the phoneme sequence based on the alignment information;

copying the phoneme with the front time node in the two phonemes according to the time interval to obtain an extended phoneme corresponding to each phoneme;

and correspondingly adding the extended phoneme corresponding to each phoneme into the phoneme sequence.

Preferably, the synthesizing of the speech corresponding to the text according to the target phoneme sequence includes:

acquiring an amplitude value of a pronunciation frequency spectrum of each phoneme, and taking a part of the pronunciation frequency spectrum, of which the amplitude value is larger than a preset amplitude value, as Gaussian noise of the target phoneme sequence;

and synthesizing the speech corresponding to the text by using the target phoneme sequence without the Gaussian noise.

Preferably, the expanding the phoneme sequence based on the alignment information until the length of the phoneme sequence is consistent with the length of the preset mel frequency spectrum includes:

and adjusting the voice speed of the expanded phoneme sequence, and taking the phoneme sequence with the adjusted voice speed as the target phoneme sequence.

Preferably, the identifying a phoneme sequence contained in the text includes:

performing word segmentation processing on the text to obtain a plurality of words;

determining the sub-phoneme sequence corresponding to each participle;

and combining the sub-phoneme sequences corresponding to all the participles according to a preset combination mode to generate a phoneme sequence of the text.

Preferably, the extracting the context information from the phoneme sequence includes:

decomposing the phoneme sequence into a plurality of phonemes and determining an embedding vector of each phoneme;

carrying out nonlinear transformation on the embedded vector of each phoneme to obtain a nonlinear characteristic corresponding to each phoneme;

determining a context feature corresponding to the nonlinear feature of each phoneme, and splicing the context feature of each phoneme to obtain the context information.

The present application also provides a speech synthesis apparatus, which includes:

the recognition module is used for recognizing a phoneme sequence contained in the text and extracting context information from the phoneme sequence;

the matching module is used for matching the lengths of the phoneme sequence and a preset Mel frequency spectrum according to the context information and judging whether the phoneme sequence needs to be expanded or not according to a matching result;

the extension module is used for preprocessing the text when the phoneme sequence needs to be extended, determining alignment information corresponding to the text, and extending the phoneme sequence based on the alignment information until the length of the phoneme sequence is consistent with the length of the preset Mel frequency spectrum to obtain a target phoneme sequence; the alignment information represents the alignment relation between the speech to be synthesized and the text;

and the synthesis module is used for synthesizing the voice corresponding to the text according to the target phoneme sequence.

The present application further provides a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of any of the above methods when executing the computer program.

The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of any of the methods described above.

According to the speech synthesis method, the speech synthesis device, the speech synthesis equipment and the speech synthesis storage medium, a phoneme sequence contained in a text is identified, and context information is extracted from the phoneme sequence; performing length matching on the phoneme sequence and a preset Mel frequency spectrum according to the context information, and judging whether the phoneme sequence needs to be expanded according to a matching result; if so, preprocessing the text, determining alignment information corresponding to the text, and expanding a phoneme sequence based on the alignment information until the length of the phoneme sequence is consistent with the length of the preset Mel frequency spectrum to obtain a target phoneme sequence; and synthesizing the voice corresponding to the text according to the target phoneme sequence, wherein the pronunciation of each phoneme in the phoneme sequence is related to the context information, so that the length of the phoneme sequence is accurately expanded according to the context information of the phoneme sequence in the recognized text and is consistent with the length of the target phoneme sequence, the synthesized voice has the sense of reality of resisting the yangtong frustration, and the voice synthesis effect is improved.

Drawings

FIG. 1 is a flowchart illustrating a speech synthesis method according to an embodiment of the present application;

FIG. 2 is a block diagram illustrating a speech synthesis apparatus according to an embodiment of the present application;

fig. 3 is a block diagram illustrating a structure of a computer device according to an embodiment of the present application.

The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.

The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

In recent years, with the development of deep learning, a TTS Model (Text-To-Speech Synthesis Model) has attracted much attention. Most deep learning based speech synthesis systems consist of two parts: firstly, a text-to-speech model converts a text into acoustic features, such as Mel frequency spectra; the second is a vocoder, which generates time domain voice waveform by using the acoustic characteristic.

Modern TTS models can be divided into autoregressive models and non-autoregressive models. Where the autoregressive model can generate high quality samples by decomposing the output distribution into products of conditional distributions, the autoregressive model lacks robustness due to accumulated prediction errors, and in some cases, errors of word skipping and repetition occur. Non-autoregressive models such as Tacotron2 and Transformer-TTS can produce natural speech, but are limited in that the inference time increases linearly with increasing Mel-spectrogram length.

Recently, various non-autoregressive TTS models have been proposed to overcome the shortcomings of autoregressive models. Although the non-autoregressive TTS model can synthesize speech stably and its reasoning process is faster than that of the autoregressive model, the non-autoregressive TTS model still has some limitations: since feedforward models such as FastSpeech cannot produce different synthesized speech because they are optimized by simple regression without any probabilistic modeled objective function, the synthesized speech is less effective.

Therefore, in order to solve the above technical problem, referring to fig. 1, the present application proposes a speech synthesis method, which can be implemented by a TTS system based on a denoising diffusion probability model, where the TTS system can be configured in a server and is composed of a text encoder, a step encoder, a duration predictor and a decoder. The server may be an independent server, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

In one embodiment, the speech synthesis method includes the following steps:

s11, identifying a phoneme sequence contained in the text, and extracting context information from the phoneme sequence;

s12, performing length matching on the phoneme sequence and a preset Mel frequency spectrum according to the context information, and judging whether the phoneme sequence needs to be expanded according to a matching result;

s13, if yes, preprocessing the text, determining alignment information corresponding to the text, expanding the phoneme sequence based on the alignment information until the length of the phoneme sequence is consistent with the length of the preset Mel frequency spectrum, and obtaining a target phoneme sequence; the alignment information represents the alignment relation between the speech to be synthesized and the text;

and S14, synthesizing the speech corresponding to the text according to the target phoneme sequence.

As described in step S11, the input text may be preprocessed to remove invalid characters, disambiguate, etc. from the text, so that the final synthesized speech has a better effect. Where the invalid characters include commas, periods, or flag information interspersed with the text, and disambiguation is the removal of information in the text that has a logical error.

Specifically, the text is source data for speech synthesis processing, and specifically refers to a data material for synthesizing speech. For example, when a user wants to output the text by a client in a voice broadcast manner, the text can be sent to a background server, the background server is a TTS system based on a denoising diffusion probability model, and the background server can synthesize the text content into voice and then return the voice to the client for playing.

A phoneme sequence is a structure composed of a plurality of phonemes, which are the smallest units in speech, analyzed according to pronunciation actions in syllables, one action constituting one phoneme. For text, both English and Chinese, there are corresponding pronunciation rules when people read aloud. For example, in the case of Chinese, the pronunciation is determined by its standard Pinyin, while in the case of English, the pronunciation is determined by its standard phonetic symbol. For Chinese pinyin or English phonetic symbols, the composition is composed of phonemes with the smallest units, such as initial consonants and vowels in Chinese pinyin, and vowels and consonants in English phonetic symbols.

The present embodiment may utilize a text encoder to extract context information from the phoneme sequence and then provide it to the duration predictor and decoder. Where the text encoder consists of 10 extended convolution residual blocks and one LSTM layer.

As described in the above step S12, the length of the phoneme sequence and the mel spectrum may be matched using a length adjustment module of the duration predictor, and the length adjuster needs alignment information to extend the phoneme sequence and control the speech speed. Preferably, a forced alignment method MFA (automatic speech-to-phoneme alignment tool) may be used instead of the commonly used attention-based alignment extractor. Among other things, MFAs provide more robust alignment than attention-based mechanisms, thereby improving alignment accuracy. In addition, the duration predictor uses the duration extracted from the MFA to stabilize the duration prediction. Wherein the matching result includes a result of whether the phoneme sequence is consistent with a preset length of the mel spectrum. When the length of the phoneme sequence is inconsistent with the length of a preset Mel frequency spectrum, judging that the phoneme sequence needs to be expanded; when the phoneme sequence is consistent with the length of the preset Mel frequency spectrum, the phoneme sequence is judged not to need to be expanded.

In an embodiment, the context information includes position information of each phoneme of the phoneme sequence in the phoneme sequence, and the length matching the phoneme sequence with a preset mel spectrum according to the context information may specifically include:

Specifically, the duration of each phoneme in the phoneme sequence can be determined according to the context information, and the duration can be understood as the playing duration of each phoneme or the time playing interval of two adjacent phonemes when the speech is played. The context information includes position information of each phoneme in the phoneme sequence, and a phoneme before each phoneme or a phoneme after each phoneme, or may further include other information such as a pronunciation duration of each phoneme. The pronunciation of each phoneme is related to the context information of the phoneme, and the corresponding pronunciation frequency spectrums are also different, so the context information of each phoneme needs to be determined, the pronunciations and the pronunciation frequency spectrums of all phonemes of the phoneme sequence are further determined according to the context information, the target frequency spectrums of the phoneme sequence are generated by splicing based on the pronunciation frequency spectrums of each phoneme, and the target frequency spectrums are spliced according to the sequence of the phonemes in the text during splicing. And then judging whether the target frequency spectrum of the phoneme sequence is matched with a preset Mel frequency spectrum, if not, judging that the phoneme sequence needs to be expanded. The preset mel frequency spectrum may be a sound frequency spectrum generated based on a voice segment generated after a professional reads the text.

When it is determined that the phoneme sequence does not match the length of the preset mel spectrum as described in the above step S13, the phoneme sequence needs to be extended until the length of the phoneme sequence coincides with the length of the preset mel spectrum, and then the extended phoneme sequence is used as the target phoneme sequence. For example, any two adjacent phonemes in the phoneme sequence may be selected, and a new phoneme may be added between the two adjacent phonemes to extend the phoneme sequence; the new phone may be a custom common phone, such as the phone with chinese syllable ā (o). Preferably, the new phoneme may select a phoneme that is similar to the adjacent phoneme. In addition, the time interval between the two adjacent phonemes can be extended to extend the phoneme sequence to obtain the target phoneme sequence. Wherein the target phoneme sequence is a sequence having the same length as the preset mel frequency spectrum.

In this embodiment, the alignment information includes the position relationship and the time and length information of each phoneme in the text, and the phonemes of the synthesized speech and the phonemes of the text are made to correspond according to the position relationship and the time and length information of each phoneme, so as to avoid errors caused by inconsistency between the synthesized speech and the text. In addition, a position tag and a time length tag can be added behind each phoneme, and a new phoneme is added behind the position tag which is lower than the preset time length so as to expand the phoneme sequence.

In an embodiment, when determining the alignment information corresponding to the text, the text may be identified to obtain the alignment information corresponding to the text, that is, the alignment information is obtained by identifying a pre-trained identification model, and the training mode of the identification model may include: acquiring a training data pair, wherein the training data pair comprises paired text data and voice data; respectively coding the text data and the voice data to obtain text characteristics corresponding to the text data and voice characteristics corresponding to the voice data; inputting the text features into an initial recognition model, training the initial recognition model, and outputting a recognition result corresponding to the text features through the initial recognition model; generating alignment information corresponding to the training data pairs according to the text features and the voice features; determining model loss of the initial recognition model according to the alignment information and the recognition result, performing iterative training on the initial recognition model based on the model loss to obtain a trained recognition model, and recognizing the alignment information corresponding to the text by using the trained recognition model.

It should be noted that the new phoneme may be a customized standard phoneme, which has versatility.

As described in step S14 above, this step can synthesize the speech corresponding to the text according to the target phoneme sequence. Specifically, the pronunciation duration, pronunciation frequency, tone and intonation of each phoneme can be obtained, the pronunciation duration, pronunciation frequency, tone and intonation of each phoneme are added into the target phoneme sequence, and the phonemes are spliced according to the front-back relation in the text to obtain the voice corresponding to the text.

When the phonemes are spliced, the phonemes are sequentially spliced according to the positions of the phonemes in the text, so that the sequence of the spliced phonemes completely conforms to the original positions of the phonemes in the text, and errors of the synthesized speech are avoided.

The application provides a speech synthesis method, which comprises the steps of identifying a phoneme sequence contained in a text, and extracting context information from the phoneme sequence; performing length matching on the phoneme sequence and a preset Mel frequency spectrum according to the context information, and judging whether the phoneme sequence needs to be expanded according to a matching result; if so, preprocessing the text, determining alignment information corresponding to the text, and expanding a phoneme sequence based on the alignment information until the length of the phoneme sequence is consistent with the length of the preset Mel frequency spectrum to obtain a target phoneme sequence; and synthesizing the voice corresponding to the text according to the target phoneme sequence, wherein the pronunciation of each phoneme in the phoneme sequence is related to the context information, so that the length of the phoneme sequence is accurately expanded according to the context information of the phoneme sequence in the recognized text and is consistent with the length of the target phoneme sequence, the synthesized voice has the sense of reality of resisting the yangtong frustration, and the voice synthesis effect is improved.

In an embodiment, in step S13, the expanding the phoneme sequence based on the alignment information may specifically include:

In this embodiment, the boundary point of two phonemes in the phoneme sequence may be predicted, and after the boundary point of two phonemes is obtained through prediction, the phonemes may be divided by using the boundary point of the phonemes, so as to locate the phonemes, specifically, locate the start position and the end position of the phoneme. The starting position of a phoneme is the boundary point with the previous phoneme, and the ending position of the phoneme is the boundary point with the next phoneme. After the initial position and the end position of the phoneme are determined, the length and the time interval of the phoneme can be determined, the phoneme with the time node before in the two phonemes is copied based on the length and the time interval of the phoneme, and the phoneme is added into the phoneme sequence to expand the phoneme sequence so as to ensure the pronunciation duration of the phoneme.

In an embodiment, the synthesizing of the speech corresponding to the text according to the target phoneme sequence may include:

In this embodiment, the pronunciation spectrum may be displayed in the form of a spectrogram, the amplitude value of the pronunciation spectrum of each phoneme is obtained, the amplitude value and the preset amplitude value of the pronunciation spectrum of each phoneme are calibrated on the spectrogram, and a portion of the pronunciation spectrum whose amplitude value is greater than the preset amplitude value is used as the gaussian noise of the target phoneme sequence. For example, the preset amplitude value may be set to 70dB, and a portion of the utterance spectrum having an amplitude value greater than 70dB is used as gaussian noise to remove noise.

In addition, phoneme embedding and diffusion step embedding of the target phoneme sequence can be extracted, and a decoder predicts Gaussian noise from the t-order latent variable according to the phoneme embedding and diffusion step embedding, removes the Gaussian noise and synthesizes the speech corresponding to the text.

In an embodiment, the expanding the phoneme sequence based on the alignment information until the length of the phoneme sequence is consistent with the length of the preset mel spectrum further includes:

The embodiment can adjust the speech speed of the phoneme sequence, during the adjustment, the normal speech speed of the human speech can be collected in advance, the speech speed of the phoneme sequence is adjusted according to the normal speech speed, the speech speed of the phoneme sequence is made to be consistent with the normal speech speed, and the phoneme sequence after the speech speed is adjusted is used as the target phoneme sequence, so that the synthesized speech is natural and smooth when being played.

In an embodiment, in step S11, the recognizing the phoneme sequence included in the text may specifically include:

s111, performing word segmentation processing on the text to obtain a plurality of words;

s112, determining a sub-phoneme sequence corresponding to each participle;

s113, combining the sub-phoneme sequences corresponding to all the participles according to a preset combination mode to generate a phoneme sequence of the text.

In the present embodiment, the sub-phoneme sequence is a phoneme sequence corresponding to a part of the input text, and a plurality of sub-phoneme sequences constitute the phoneme sequence of the input text. In addition, each word corresponds to a unique phoneme sequence, the phoneme sequence corresponding to each word in the input text is used as a sub-phoneme sequence, and the sub-phoneme sequences corresponding to all the words in the input text form the phoneme sequence of the input text.

In another embodiment, the text may be further grouped, each group includes at least one word or word, one group corresponds to one sub-phoneme sequence, and the phoneme sequence corresponding to the word or word included in each group is the sub-phoneme sequence corresponding to the group.

If the text is a Chinese text, word segmentation can be performed based on the matching result of the text information and the word segmentation dictionary, a sub-phoneme sequence corresponding to each word segmentation is determined, and a phoneme sequence of the text is generated. For example, the text is "how good you", and the word segmentation processing is performed to obtain the words "you", "good" and "do". The three participles can be matched with a phoneme dictionary, the sub-phoneme sequences corresponding to the three participles are determined, and finally the three sub-phoneme sequences are combined to obtain the phoneme sequence of the text.

In addition, the embodiment of the application can also identify the phonemes contained in the text by utilizing the trained phoneme identification model. The phoneme recognition model is specifically a model constructed by a neural network and used for recognizing phonemes contained in the text. Specifically, in the embodiment of the present application, a Sequence to Sequence (Seq2Seq) model is used as a phoneme recognition model, and Seq2Seq is a kind of Endto frame, that is, implemented from a coding-decoding frame, and can predict phonemes in a process of converting an input Sequence of any text and speech input into a vector code of a fixed length by an Encoder coding, and then converting the fixed vector generated before into an output Sequence by a Decoder decoding.

In order to provide the neural network with the capability of recognizing phonemes from text, the present application first trains a phoneme recognition model. When the phoneme recognition model is trained, the text sample with the labeled phonemes is input into the phoneme recognition model for phoneme recognition, and then model recognition errors are judged by comparing the model recognition phoneme results with the labeled phonemes. And adjusting the recognition operation parameters of the phoneme recognition model according to the phoneme recognition error of the phoneme recognition model, then performing phoneme recognition by using the phoneme recognition model after parameter adjustment again, calculating the recognition error, and performing parameter adjustment on the phoneme recognition model according to the recognition error. And repeating the training process until the recognition error of the phoneme recognition model is smaller than the set error, finishing the training of the phoneme recognition model, and recognizing the phonemes contained in the text by using the trained phoneme recognition model.

In an embodiment, the extracting the context information from the phoneme sequence includes:

In this embodiment, the phoneme sequence may be decomposed into a plurality of phonemes according to a position context, and an embedded vector of each phoneme may be determined, specifically, a numerical value may be assigned to each phoneme in the phoneme data set, and then each phoneme may be represented in a form of a One-Hot (One-Hot) vector to obtain an embedded vector, or a neural network model may be directly constructed, and end-to-end training of the neural network model from a text to an embedded vector of the phoneme may be performed, so as to obtain the embedded vector according to the text.

And then carrying out nonlinear transformation on the embedded vector of each phoneme to obtain the nonlinear feature corresponding to each phoneme, and determining the context feature corresponding to the nonlinear feature of each phoneme. Specifically, the embedded vector of each phoneme can be determined, and the feature extraction processing is performed on the embedded vector of each phoneme to obtain the context feature of each phoneme.

And finally, splicing the context characteristics of each phoneme according to time to obtain accurate context information.

Referring to fig. 2, an embodiment of the present application further provides a speech synthesis apparatus, including:

the recognition module 11 is configured to recognize a phoneme sequence included in the text, and extract context information from the phoneme sequence;

a matching module 12, configured to perform length matching on the phoneme sequence and a preset mel spectrum according to the context information, and determine whether the phoneme sequence needs to be extended according to a matching result;

an extension module 13, configured to, when the phoneme sequence needs to be extended, pre-process the text, determine alignment information corresponding to the text, and extend the phoneme sequence based on the alignment information until the length of the phoneme sequence is consistent with the length of the preset mel spectrum, so as to obtain a target phoneme sequence; the alignment information represents the alignment relation between the speech to be synthesized and the text;

and a synthesizing module 14, configured to synthesize a speech corresponding to the text according to the target phoneme sequence.

In the device, the input text can be preprocessed to remove invalid characters, eliminate ambiguity and the like in the text, so that the finally synthesized voice has a better effect. Where the invalid characters include commas, periods, or flag information interspersed with the text, and disambiguation is the removal of information in the text that has a logical error.

The apparatus may also match the lengths of the phoneme sequence and the mel-frequency spectrum using a length adjustment module of the duration predictor, which requires alignment information to extend the phoneme sequence and control the speech speed. Preferably, a forced alignment method MFA (automatic speech-to-phoneme alignment tool) may be used instead of the commonly used attention-based alignment extractor. Among other things, MFAs provide more robust alignment than attention-based mechanisms, thereby improving alignment accuracy. In addition, the duration predictor uses the duration extracted from the MFA to stabilize the duration prediction. Wherein the matching result includes a result of whether the phoneme sequence is consistent with a preset length of the mel spectrum. When the length of the phoneme sequence is inconsistent with the length of a preset Mel frequency spectrum, judging that the phoneme sequence needs to be expanded; when the phoneme sequence is consistent with the length of the preset Mel frequency spectrum, the phoneme sequence is judged not to need to be expanded.

In an embodiment, the context information includes position information of each phoneme of the phoneme sequence in the phoneme sequence, and the matching module 12 may be specifically configured to:

When it is determined that the phoneme sequence does not match the length of the preset mel spectrum, the phoneme sequence needs to be extended until the length of the phoneme sequence is consistent with the length of the preset mel spectrum, and then the extended phoneme sequence is used as a target phoneme sequence. For example, any two adjacent phonemes in the phoneme sequence may be selected, and a new phoneme may be added between the two adjacent phonemes to extend the phoneme sequence; the new phone may be a custom common phone, such as the phone with chinese syllable ā (o). Preferably, the new phoneme may select a phoneme that is similar to the adjacent phoneme. In addition, the time interval between the two adjacent phonemes can be extended to extend the phoneme sequence to obtain the target phoneme sequence. Wherein the target phoneme sequence is a sequence having the same length as the preset mel frequency spectrum.

And finally, synthesizing the speech corresponding to the text according to the target phoneme sequence. Specifically, the pronunciation duration, pronunciation frequency, tone and intonation of each phoneme can be obtained, the pronunciation duration, pronunciation frequency, tone and intonation of each phoneme are added into the target phoneme sequence, and the phonemes are spliced according to the front-back relation in the text to obtain the voice corresponding to the text.

As described above, it can be understood that each component of the speech synthesis apparatus proposed in the present application can implement the function of any one of the speech synthesis methods described above, and the detailed structure is not described again.

Referring to fig. 3, an embodiment of the present application further provides a computer device, and an internal structure of the computer device may be as shown in fig. 3. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities.

The memory of the computer device comprises a storage medium and an internal memory. The storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the operation of the operating system and computer programs in the storage medium. The database of the computer device is used for storing data such as text, voice and the like.

The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a speech synthesis method.

The processor executes the speech synthesis method, and the method comprises the following steps:

An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, the computer program, when executed by a processor, implementing a speech synthesis method, including the steps of:

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. Any reference to memory, storage, database, or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).

To sum up, the most beneficial effect of this application lies in:

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.

Claims

1. A method of speech synthesis, comprising:

2. The method of claim 1, wherein the context information comprises position information of each phoneme of the phoneme sequence in the phoneme sequence, and wherein the length matching the phoneme sequence with a preset Mel frequency spectrum according to the context information comprises:

3. The method of claim 2, wherein the synthesizing of the speech corresponding to the text from the target phoneme sequence comprises:

4. The method of claim 1, wherein the extending the sequence of phonemes based on the alignment information comprises:

5. The method of claim 1, wherein the extending the phone sequence based on the alignment information until the length of the phone sequence is consistent with the length of the pre-set Mel spectrum further comprises:

6. The method of claim 1, wherein the identifying the sequence of phonemes contained in the text comprises:

determining a sub-phoneme sequence corresponding to each participle;

7. The method of claim 1, wherein extracting context information from the sequence of phonemes comprises:

8. A speech synthesis apparatus, comprising:

9. A computer device, comprising:

a processor;

a memory;

a computer program, wherein the computer program is stored in the memory and configured to be executed by the processor, the computer program being configured to perform the speech synthesis method according to any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the speech synthesis method according to any one of claims 1 to 7.