CN117612512A

CN117612512A - Training method of voice model, voice generating method, equipment and storage medium

Info

Publication number: CN117612512A
Application number: CN202311531840.6A
Authority: CN
Inventors: 徐东
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2023-11-15
Filing date: 2023-11-15
Publication date: 2024-02-27

Abstract

The application relates to a training method of a voice model, a voice generating method, a computer device and a storage medium. Inputting text structure information and phoneme information of text sample data into a voice model, respectively encoding the text structure information and the phoneme information by the voice model, obtaining predicted voice characteristics based on the text encoding data obtained by encoding, and adjusting model parameters according to the similarity between the predicted voice characteristics and reference voice characteristics until the conditions are met, so as to obtain a trained voice model; the corresponding predicted speech is output by the speech model based on text structure information and phoneme information in the target text input by the user. Compared with the traditional voice obtained through voice synthesis, the voice generation method and the voice generation device have the advantages that voice generation is carried out by combining text structure information and phoneme information, so that generated voice and language pauses more accurately and more naturally, and the naturalness of voice generation is improved.

Description

Training method of voice model, voice generating method, equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technology, and in particular, to a training method for a speech model, a speech generating method, a computer device, a storage medium, and a computer program product.

Background

Along with the rapid development of deep learning and the explosion of hardware computing power, corresponding voice can be generated through technologies such as voice synthesis and the like at present, and the effect of speaking of a true person is simulated. However, the speech currently being generated is still unnatural in terms of auditory sensation as compared to the prosody of a pause in which a real person speaks.

Therefore, the current voice generation method has the defect that generated voice is unnatural.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a training method, a speech generating method, a computer device, a computer-readable storage medium, and a computer program product for a speech model capable of improving the naturalness of a generated speech.

In a first aspect, the present application provides a method for training a speech model, the method comprising:

acquiring text sample data and voice characteristics of target voice corresponding to the text sample data;

acquiring text structure information and phoneme information of the text sample data, inputting the text structure information and the phoneme information into a voice model to be trained, obtaining text coding data according to the text structure information and the phoneme information by a text coder of the voice model, and obtaining predicted voice characteristics according to the text coding data by a decoder of the voice model;

The voice characteristic coding data of the voice characteristic is obtained through a voice coder, the voice characteristic coding data is input into the voice model, and the decoder obtains the reference voice characteristic according to the voice characteristic coding data;

and adjusting model parameters of the voice model to be trained according to the similarity between the predicted voice characteristics and the reference voice characteristics until a preset training ending condition is met, so as to obtain the trained voice model.

In one embodiment, obtaining text structure information of the text sample data includes:

according to the ordering information of each word in the text sample data, obtaining word information of the text sample data;

determining the syntax information of the text sample data according to the word composition information corresponding to the text sample data;

and obtaining text structure information according to the word information and the syntax information.

In one embodiment, the text structure information includes word information and syntax information; the text coding data is obtained according to the text structure information and the phoneme information, and the method comprises the following steps:

obtaining first text coding data according to the phoneme information corresponding to each word in the word information;

Determining word attributes of each word in the word information according to the syntax information;

obtaining second text coding data according to the word attributes of the words;

determining start-stop time prediction data of the phoneme information according to the text sample data;

and obtaining the text coding data according to the first text coding data, the second text coding data and the start-stop time prediction data of the phoneme information.

In one embodiment, the method further comprises:

acquiring continuous word and sentence information marked in advance in the text sample data; the pause time between each word in the continuous word and sentence information is smaller than a preset threshold value;

and inputting the continuous word and sentence information into the voice model to be trained, and determining the start-stop time prediction data of each phoneme in the phoneme information according to the continuous word and sentence information by the voice model.

In one embodiment, the obtaining continuous word and sentence information pre-labeled in the text sample data includes:

inputting the text sample data into a trained word segmentation model, and obtaining pre-labeled continuous word and sentence information according to word segmentation results corresponding to the text sample data output by the word segmentation model;

And/or the number of the groups of groups,

obtaining a maximum pause time threshold for each group of adjacent words in the text sample data, and obtaining pre-labeled continuous word and sentence information according to the adjacent words of which the maximum pause time threshold is smaller than the preset threshold;

and/or the number of the groups of groups,

and acquiring a preset word and sentence aiming at the text sample data input, and acquiring continuous word and sentence information marked in advance according to the preset word and sentence.

In a second aspect, the present application provides a speech generation method, the method comprising:

acquiring a target text, and acquiring text structure information and phoneme information of the target text;

inputting the target text into a trained voice model, obtaining text coding data by a text coder of the voice model according to text structure information and phoneme information of the target text, obtaining predicted voice features by a decoder of the voice model according to the text coding data, inputting the predicted voice features into a vocoder in the voice model, and outputting corresponding predicted voice by the vocoder according to the predicted voice features; the voice model is obtained through training according to the method;

and obtaining the voice corresponding to the target text according to the predicted voice.

In one embodiment, the obtaining the target text includes:

acquiring an original text input by a user;

acquiring continuous word and sentence information in the original text;

and obtaining a target text according to the original text carrying the continuous word and sentence information.

In one embodiment, the obtaining continuous word and sentence information in the original text includes:

inputting the original text into a trained word segmentation model, and obtaining continuous word and sentence information according to a word segmentation result corresponding to the original text output by the word segmentation model;

and/or the number of the groups of groups,

obtaining a maximum pause time threshold for each group of adjacent words in the original text, and obtaining continuous word and sentence information according to the adjacent words of which the maximum pause time threshold is smaller than the preset threshold;

and/or the number of the groups of groups,

and acquiring a preset word and sentence aiming at the original text input, and acquiring continuous word and sentence information according to the preset word and sentence.

In a third aspect, the present application provides a computer device comprising a memory storing a computer program and a processor implementing the steps of the method described above when the processor executes the computer program.

In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method described above.

In a fifth aspect, the present application provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the method described above.

The training method, the voice generating method, the computer equipment, the storage medium and the computer program product of the voice model are characterized in that text structure information and phoneme information of text sample data are input into the voice model, the text structure information and the phoneme information are respectively encoded by the voice model, predicted voice characteristics are obtained based on the text encoding data obtained by encoding, model parameters are adjusted according to the similarity of the predicted voice characteristics and reference voice characteristics, and the trained voice model is obtained until the conditions are met; the corresponding predicted speech is output by the speech model based on text structure information and phoneme information in the target text input by the user. Compared with the traditional voice obtained through voice synthesis, the voice generation method and the voice generation device have the advantages that voice generation is carried out by combining text structure information and phoneme information, so that generated voice and language pauses more accurately and more naturally, and the naturalness of voice generation is improved.

Drawings

FIG. 1 is an application environment diagram of a training method of a speech model in one embodiment;

FIG. 2 is a flow chart of a method of training a speech model in one embodiment;

FIG. 3 is a flow chart illustrating the steps of pause control in one embodiment;

FIG. 4 is a flow diagram of a method of speech generation in one embodiment;

FIG. 5 is a flow chart of a training method of a speech model according to another embodiment;

fig. 6 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

The training method and the voice generating method for the voice model provided by the embodiment of the application can be applied to an application environment shown in fig. 1. Wherein the terminal communicates with the server through a network. The data storage system may store data that the server needs to process. The data storage system may be integrated on a server or may be placed on a cloud or other network server. The server can train the voice model to be trained based on the text sample data and the target voice corresponding to the text sample data to obtain a trained voice model, the terminal can acquire the target text input by the user, the target text is sent to the server, and the server outputs the corresponding predicted voice based on the trained voice model and the target text and sends the predicted voice to the terminal. The terminal may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, which may be smart watches, smart bracelets, headsets, etc. The server may be implemented as a stand-alone server or as a server cluster composed of a plurality of servers.

In one embodiment, as shown in fig. 2, a training method of a speech model is provided, and the method is applied to the server in fig. 1 for illustration, and includes the following steps:

step S202, acquiring text sample data and voice characteristics of target voices corresponding to the text sample data.

The text sample data can be pre-stored in a database of the server, and can be used for training a voice model to be trained. Wherein the text sample data may be determined based on text entered by the user at a historical time. The speech model may be used to output corresponding speech based on text.

The server may also obtain a target voice corresponding to the text sample data, and obtain a voice feature of the target voice. The target voice may be a voice corresponding to the text sample data and conforming to natural language pauses and speaking habits, and the target voice may be a voice manually generated in advance based on the text sample data, or a voice correctly output corresponding to the text sample data during the generation of the historical voice. For the voice features of the target voice, the server can extract by means of machine learning.

In the training stage, the server can combine the text sample data and the voice characteristics of the target voice corresponding to the text sample data to train the voice model to be trained, so that the voice model constructs an automatic mapping capability, and the mapping relation between the characters and the voices corresponding to the characters is obtained. The mapping relationship includes a word-to-word pause relationship, and the like. The trained voice model can output corresponding voice which is naturally stopped and accords with the natural speaking habit based on the input text.

Step S204, obtaining text structure information and phoneme information of the text sample data, inputting the text structure information and the phoneme information into a voice model to be trained, obtaining text coding data by a text coder of the voice model according to the text structure information and the phoneme information, and obtaining predicted voice characteristics by a decoder of the voice model according to the text coding data.

Wherein, the text sample data can comprise a plurality of types of information. The server may acquire text structure information and phoneme information of the text sample data. Wherein the text structure information indicates each word in the text sample data and information of a relationship between each word. The text structure information may include various information such as word information, syntax information, and the like. Wherein the word information represents each word in the text sample data and the syntax information represents an attribute of the word and sentence, so that the server can determine each word in the text sample data and a relationship between words based on the text structure information. The server may also obtain phoneme information corresponding to the text sample data, such as phonemes and tones for each word in the text sample data.

The server can input the obtained text structure information and the phoneme information and the voice characteristic information into a voice model to be trained, and the voice model can respectively encode the text structure information and the phoneme information and obtain predicted voice characteristics based on text encoding data obtained by encoding. Text-to-text encoders and decoders may be included in the speech model. The text encoder may correspond to the information of the input speech model as described above, e.g., text structure information corresponds to a text structure information encoder, phoneme information corresponds to a phoneme encoder, etc. The voice model encodes the corresponding information through the corresponding text encoder to form text encoding data. The above-mentioned speech features may be input into the decoder so that the speech model generates corresponding predicted speech features in the decoder using the text-encoded data with reference to the speech features.

The text structure information and the phoneme information can be obtained in a corresponding mode. For example, in one embodiment, for text structure information, the server may obtain word information of the text sample data based on ranking information of the individual words in the text sample data. And the server can also determine the syntax information of the text sample data according to the word composition information corresponding to the text sample data. And the server obtains text structure information according to the word information and the syntax information. For phoneme information, the server may extract the text sample data by a tool.

Specifically, for the phoneme information, phonemes and tones may be included, and the server may extract text sample data through a jieba tool or the like. The phoneme format can have various forms, such as international phonetic symbols, initials and finals in Chinese phonetic alphabets, etc. The tone refers to the level of yin, level of yang, up, down, and light. Each pronunciation in the target voice can be corresponding to a corresponding phoneme, and the server can obtain the voice start and stop time corresponding to each phoneme by aligning the phoneme information with the target voice in advance, for example, by an automatic alignment mode or manual labeling.

In a text encoder, a server may train a speech model to recognize the predictive capabilities of the start and stop times of speech corresponding to the phoneme information. For example, the voice model predicts the start-stop time of the voice based on the phoneme information to obtain the predicted start-stop time of the voice, and compares the predicted start-stop time of the voice with the pre-aligned start-stop time of the voice to adjust model parameters of the voice model for predicting the start-stop time of the voice based on the phoneme, and carries out iterative training until the similarity is greater than or equal to a similarity threshold of the pre-set start-stop time within the preset training times, or the training times reach the preset times, and the completion of training the start-stop time of the voice corresponding to the recognition phoneme of the voice model is determined.

For the word information, information of each word in the input text sample data arranged in order may be included. For example, "Chinese historical culture is a long-standing source with five thousand years of brilliant civilization. "this sentence is word information". When training the text structure information of the voice model, the text encoder can perform joint modeling on the phoneme information corresponding to each word, and fuse the word with coarse granularity and the phoneme with fine granularity.

For syntactic information, attributes of words and sentences, such as the subject-object, noun, verb, adjective, etc., may be included. The server can obtain a syntactic dependency analysis result by performing natural language processing on the input text sample data. For example, the server may extract the syntax information in the text sample data through tools such as hanlp (Han Language Processing, han language processing package), stanfordNLP (Stanford Natural Language Processing ), ltp (Language Technology Platform, language technology platform), and the like. The server can thus combine the higher-dimensional syntactic information, word information, phoneme information, and speech features to generate corresponding predicted speech features. The speech model to be trained is trained based on the predicted speech features.

Step S206, obtaining the voice characteristic coding data of the voice characteristic through the voice coder, inputting the voice characteristic coding data into the voice model, and obtaining the reference voice characteristic by the decoder according to the voice characteristic coding data.

Wherein, the predicted speech feature and the speech feature of the target speech may be a spectrum, such as mel-spectrogram. After the speech model generates the predicted speech feature through the decoder, the predicted speech feature may be compared with the reference speech feature of the target speech, where the encoder in the server may include a speech encoder and a text encoder, and the speech feature of the target speech may encode the target speech feature through the speech encoder and decode the encoded speech feature through the decoder, thereby obtaining the reference speech feature.

Step S206, according to the similarity between the predicted voice feature and the reference voice feature, the model parameters of the voice model to be trained are adjusted until the preset training ending condition is met, and the trained voice model is obtained.

The server can adjust model parameters according to the comparison between the reference voice characteristics and the predicted voice characteristics. For example, the server inputs the predicted speech feature and the speech feature of the target speech into a preset loss function, and determines the similarity between the predicted speech feature and the speech feature by detecting whether the function value of the loss function converges. And the server adjusts model parameters of the voice model to be trained according to the similarity. And returning to the step of inputting the text structure information and the phoneme information into the voice model to be trained, and performing the next training until the preset training ending condition is met, wherein the server can stop training and obtain the trained voice model. The preset training ending condition may include various kinds, for example, the similarity is greater than or equal to a preset similarity threshold within a preset training number, or the training number reaches the preset training number.

In addition, in some embodiments, the voice model further includes a vocoder, and after the voice model generates the predicted voice features, the voice model may generate corresponding predicted voice based on the predicted voice features in the vocoder and output the generated predicted voice, so as to implement voice generation based on text. Specifically, the server trains by inputting the word information, the syntax information and the phoneme information into a voice model, so that not only the voice segment information corresponding to each phoneme, but also the prediction capability of the model to the start-stop time of the phonemes, the corresponding word information and the relationship information between words can be trained, and the training model generates the capability of predicting the voice of the text and the capability of predicting the pause relationship between words in the voice, so that the generated voice accords with the rule of people speaking.

In the training method of the voice model, text structure information and phoneme information of text sample data are input into the voice model, the text structure information and the phoneme information are respectively encoded by the voice model, predicted voice characteristics are obtained based on the text encoding data obtained by encoding, model parameters are adjusted according to the similarity between the predicted voice characteristics and reference voice characteristics, and the trained voice model is obtained until the conditions are met; the corresponding predicted speech is output by the speech model based on text structure information and phoneme information in the target text input by the user. Compared with the traditional voice obtained through voice synthesis, the voice generation method and the voice generation device have the advantages that voice generation is carried out by combining text structure information and phoneme information, so that generated voice and language pauses more accurately and more naturally, and the naturalness of voice generation is improved.

In one embodiment, obtaining text encoding data from text structure information and phoneme information includes: obtaining first text coding data according to the word information and the phoneme information; obtaining second text coding data according to the syntax information and the word information; and obtaining text coding data according to the first text coding data, the second text coding data and the start-stop time prediction data of the phoneme information.

In this embodiment, the text structure information may include word information and syntax information in the text. The server may encode word information, syntax information, and phoneme information using a text-to-text encoder in the speech model. The text encoder may be of various types, for example, may include a word information text encoder, a syntax text encoder, a phoneme text encoder, etc., and the speech model may encode the corresponding type of information using different text encoders.

The text structure information and the phoneme information may affect each other during the encoding process. For example, the server obtains the first text encoded data from the above-described word information and the phoneme information through the speech model. The server obtains second text coding data according to the syntax information and the word information through the voice model; in addition, the server may determine the start-stop time prediction data of the phoneme information through a speech model, for example, by performing matching training on the phoneme information and the target speech, identifying the start-stop time of each phoneme in the phoneme information in the target speech based on the target speech by the speech model, and predicting the start-stop time prediction data of the phoneme information through the speech model. Thus, for the trained speech model, the server can predict corresponding start-stop time prediction data based on the input phoneme information by the trained speech model.

Wherein the first text encoding data and the second text encoding data may be encoded in different manners. For example, in one embodiment, the server may obtain, by using a speech model, phoneme information corresponding to each word in the word information, that is, the speech model matches the word information with the phoneme information, and encodes according to the phoneme information corresponding to each word, to obtain the first text encoded data. In one embodiment, the server may determine word attributes, such as a subject object, noun, verb, adjective, etc., of each word in the word information based on the syntactic information. The voice model encodes according to the word attribute of each word, and second text encoding data can be obtained.

The first text encoding data, the second text encoding data, and the start-stop time prediction data of the phoneme information may be vector data. And the server obtains text coding data according to the first text coding data, the second text coding data and the start-stop time prediction data of the phoneme information by the voice model. Specifically, the server aligns the phoneme information with the corresponding voices through a voice model to obtain the start-stop time prediction data corresponding to each phoneme. And carrying out joint modeling on the phoneme information corresponding to each word by using a voice model, so that the word with coarse granularity and the phoneme with fine granularity are fused, and carrying out first coding. And carrying out joint modeling on the syntactic information, the word information and the phoneme information by the voice model, and carrying out second coding. The voice model can obtain voice segment information corresponding to each phoneme based on the text sample data, word information corresponding to each phoneme and relation information between words. The server may input each of the vector data obtained by the encoding into a decoder by a speech model, and generate a spectrum of a corresponding predicted speech feature by using the first text-encoded data, the second text-encoded data, and start-stop time prediction data of the phoneme information, with respect to the speech feature of the target speech. The speech model is trained based on a comparison of the speech features of the target speech and the predicted speech features.

Through the embodiment, the server encodes various text structure information and phoneme information through the voice model, generates predicted voice characteristics based on text encoding data, trains the voice model, enables the voice generated by the trained voice model to conform to the rule of speaking by a person, and improves the naturalness of voice generation.

In one embodiment, further comprising: acquiring continuous word and sentence information marked in advance in text sample data; the pause time between each word in the continuous word and sentence information is smaller than a preset threshold value; and inputting the continuous word and sentence information into a voice model to be trained, and determining the start-stop time prediction data of each phoneme in the phoneme information according to the continuous word and sentence information by the voice model.

In this embodiment, the text sample data has a predetermined pause relationship, for example, a pause relationship determined by punctuation marks. The text between every two pauses may belong to the continuous word and sentence information, i.e. the pauses between the words in the continuous word and sentence information cannot be too long, e.g. the pause time between the words in the continuous word and sentence information is smaller than a preset threshold value. Since the pause relation required by a word in different contexts is different, there may be a situation that the pause relation is wrong in the text sample data. The server may obtain, in advance, labels of continuous word and sentence information on the text sample data, so as to determine start-stop time prediction data of each phoneme in the phoneme information based on the text sample data.

After the server obtains the continuous word and sentence information marked in advance in the text sample data, the continuous word and sentence information can be input into a voice model to be trained, and the voice model determines the start and stop time prediction data of each phoneme in the phoneme information according to the continuous word and sentence information. For example, the voice model determines, based on the continuous phrase information, which words in the text sample data need to have a pause time smaller than a preset threshold value, and further determines which phonemes need to have a start-stop time interval smaller than the preset threshold value, so that the voice model generates start-stop time prediction data corresponding to each phoneme in the phoneme information under the condition.

According to the embodiment, the server can label the input text sample data in advance so that the voice model can combine with the correct pause relation to generate the predicted voice characteristics, further the predicted voice output by the voice model obtained through training accords with the rule of speaking by a person, and the naturalness of voice generation is improved.

In one embodiment, obtaining continuous word and sentence information pre-labeled in text sample data includes: inputting the text sample data into a trained word segmentation model, and obtaining pre-labeled continuous word and sentence information according to word segmentation results corresponding to the text sample data output by the word segmentation model.

In this embodiment, there may be various ways of labeling the continuous word and sentence information in the text sample data in advance. For example, the server may segment the text sample data, the server may input the text sample data into a trained segmentation model, the segmentation model outputs a corresponding segmentation result for the text sample data, and the server obtains continuous word and sentence information labeled in advance based on the segmentation result.

The server may also determine continuous phrase information by controlling the dwell time. For example, in one embodiment, the server may obtain a maximum dwell time threshold for each set of adjacent words in the text sample data, which may be preset by the respective staff member. The server can detect the maximum pause time threshold value between each group of adjacent words, and obtain the pre-labeled continuous word and sentence information according to the adjacent words of which the maximum pause time threshold value is smaller than the preset threshold value.

The server may also determine continuous phrase information by setting a preset phrase. For example, in one embodiment, the server may obtain a preset term for text sample data input, the preset term may be input in advance by a user or a related staff member, and the server may obtain continuous term information labeled in advance according to the preset term.

Specifically, as shown in fig. 3, fig. 3 is a flow chart illustrating a pause control step in one embodiment. The server may process the labeled text by the speech model after determining the continuous word and sentence information of the input text in advance. Wherein the continuous word and sentence information can be determined in various ways. For example, the server determines continuous word and sentence information through the word segmentation model, and achieves pause control. That is, the server can recognize the input text through the word segmentation model, and combine the text which may become words or phrases, and the start and stop time of phonemes in the words or phrases need to be limited, for example, limited in the preset threshold, so as to avoid overloads and pause errors. The server may also perform pause control through a duration threshold, i.e., the server may set a maximum duration of a pause between words, e.g., the above-mentioned preset threshold; thus, the server can avoid the text which is stopped for a long time inside the single sentence. In addition, the server can also perform pause control through preset words and sentences. When inputting text, the words or phrases to be continuous are input at the same time to form continuous word and sentence information, so that the voice model can adjust corresponding voice pauses based on the marked continuous word and sentence information, and the overlong pause time is avoided.

The server may determine the continuous word and sentence information based on the determination method of any one of the continuous word and sentence information, or may determine the continuous word and sentence information based on a combination of the determination methods of the plurality of continuous word and sentence information.

According to the embodiment, the server can label the continuous word and sentence information in the text sample data in various modes, so that the voice model can combine the continuous word and sentence information to generate the predicted voice conforming to the natural pause relation, and the naturalness of voice generation is improved.

In one embodiment, as shown in fig. 4, a speech generation method is provided, and the method is applied to the server in fig. 1 for illustration, and includes the following steps:

step S302, acquiring a target text, and acquiring text structure information and phoneme information of the target text.

The target text can be a text input by a user in the terminal, the terminal can send the target text input by the user to the server, and after the server acquires the target text, the server can acquire text structure information and phoneme information in the target text. The text structure information may include word information and syntax information, among others. The server can extract the phoneme information in the target text through tools such as jieba, confirm the word information by identifying the orderly arranged words in the target text, and identify the syntax information in the target text through natural language processing technology.

Step S304, inputting the target text into a trained voice model, obtaining text coding data by a text coder of the voice model according to text structure information and phoneme information of the target text, obtaining predicted voice characteristics by a decoder of the voice model according to the text coding data, inputting the predicted voice characteristics into a vocoder in the voice model, and outputting corresponding predicted voice by the vocoder according to the predicted voice characteristics; the voice model is obtained by training according to the voice model training method; and obtaining the voice corresponding to the target text according to the predicted voice.

Wherein the server may pre-train a speech model. The server inputs the target text into a voice model, and outputs corresponding predicted voice based on text structure information of the target text, namely the word information and the syntax information, and combining the phoneme information. The voice model can be obtained through training according to the voice model training method.

The voice model may include a text encoder, a decoder, and a vocoder, and the voice model may encode word information, syntax information, and phoneme information, decode text encoded data by the decoder, thereby generating predicted voice features based on the decoded data, and generate corresponding predicted voices based on the predicted voice features in the vocoder. The server can also return the predicted voice output in the terminal and play the predicted voice in the terminal, so that the user can obtain the synthesized voice corresponding to the input target text.

Specifically, when the user needs to perform speech generation, words, such as a sentence, an article, etc., that want to hear the speech may be input. The terminal can display text information input by a user, after the user clicks a synthesis button in the terminal, the terminal sends the text to the server, after the server generates voice, the generated voice can be returned to the terminal, the terminal plays the voice, and in the process of playing the voice, the color of the text can be highlighted according to the actual voice progress, so that the user can know the text progress corresponding to the current playing voice in real time.

In the above-described speech generating method, by inputting text structure information and phoneme information of the target text into the speech model, the speech model outputs the corresponding predicted speech based on the text structure information and phoneme information in the target text input by the user. Compared with the traditional voice obtained through voice synthesis, the voice generation method and the voice generation device have the advantages that voice generation is carried out by combining text structure information and phoneme information, so that generated voice and language pauses more accurately and more naturally, and the naturalness of voice generation is improved.

In one embodiment, obtaining the target text includes: acquiring an original text input by a user; acquiring continuous word and sentence information in an original text; and obtaining the target text according to the original text carrying the continuous word and sentence information.

In this embodiment, the user may preset the pause relationship between words in the text. The server can acquire the original text input by the user, the user can label the continuous word and sentence information aiming at the original text, so that the server can acquire the continuous word and sentence information in the original text, and the target text is acquired according to the original text carrying the continuous word and sentence information.

The above-mentioned pause relation can be set in various ways. For example, in one embodiment, the server may input the original text into a trained word segmentation model, and obtain continuous word and sentence information according to the word segmentation result corresponding to the original text output by the word segmentation model. Specifically, the server identifies combinations which may become words or phrases by inputting the input text into the word segmentation model, and the beginning and ending time of phonemes between each word in the combinations needs to be limited, for example, limited within the preset threshold, so that the problem of pause errors caused by overloads is avoided, and continuous word and sentence information is marked.

The server may also receive user control of the duration threshold. For example, in one embodiment, the server obtains a maximum dwell time threshold for each group of adjacent words in the original text, and obtains continuous phrase information based on adjacent words having a maximum dwell time threshold less than a preset threshold. Specifically, the user sets the maximum duration of pause between words, so that the server can determine continuous word and sentence information based on the adjacent words with the maximum duration of pause between words being smaller than the preset threshold. So that the problem of generating a remarkably long-time pause is avoided inside the single sentence.

The server may also receive words and sentences preset by the user. For example, in one embodiment, the server may obtain a preset phrase entered by the user for the original text, and obtain continuous phrase information according to the preset phrase. Specifically, when a user inputs a text, the user can input words or phrases which want to be continuous at the same time to form continuous word and sentence information, so that a voice model can adjust pauses among words in the continuous word and sentence information, and overloads are avoided.

Wherein, when generating the predicted speech based on the target text, two cases may occur in which external control of pausing is required. The first is a pause error which may occur in model prediction; the second is that the result of model prediction is correct but contradicts the subjective intent of the person. For the first, because the model is predicted based on probability statistics, for some rarely used words, the model may not be seen or seen little, resulting in poor modeling. For the second, in a specific scenario, it may be desirable to adjust pauses between certain words because of the subjective intent of the person. In the above case, the user may label the original text to obtain the target text carrying the continuous word and sentence information.

In addition, when the user inputs the text, the original text can be marked, and the target text is obtained. In some embodiments, the user may also modify the pause relationship for the target text after the server generates the predicted speech to obtain a new target text. The server thus generates new predicted speech based on the new target text. The update of the predicted speech is realized. Specifically, when the user needs to adjust the place of pause, the user can click on the corresponding text content on the terminal screen, the terminal can display the popup window, the user can edit the text in the popup window, the user adds corresponding information at the place of continuous reading or additional pause insertion, then clicks a save and synthesis button, the terminal sends the inserted continuous reading or additional pause insertion information to the server, and the server obtains new continuous word and sentence information based on the inserted continuous reading or additional pause insertion information, so that the server generates new predicted voice by combining the new continuous word and sentence information through a voice model, and returns the new predicted voice to the terminal, thereby realizing real-time updating of the voice content. Through the mode, the user can obtain full-automatic text listening experience, and personalized and controllable effects can be achieved by manually editing the text which is wanted to be locally modified.

Through the embodiment, the server can determine the pause relation in the text based on various modes, the pause relation can be updated by the user in real time, and the server generates corresponding predicted voice through a voice model by combining the pause relation, so that the naturalness of the generated voice is improved.

In one embodiment, as shown in fig. 5, fig. 5 is a flow chart of a training method of a speech model in another embodiment. In this embodiment, a training phase and an application phase are included. In the training stage, text structure information such as word information, syntax information and the like of an input text is obtained, and phoneme information of the input text is obtained. The word information and the syntax information are encoded by a text encoder, and the start-stop time prediction of each phoneme in the phoneme information is performed by the text encoder in combination with the target voice. The start-stop time prediction capability of the phoneme information is trained, for example, by a corresponding loss function.

The server inputs the encoded text encoded data to the decoder, and in a training phase, the server trains the speech model by combining the speech characteristics of the target speech with the speech model, for example, the speech model generates corresponding predicted speech characteristics based on the text encoded data in the decoder. Therefore, the server can train the voice model by taking the voice characteristics of the target voice as the target and comparing the predicted voice characteristics with the voice characteristics of the target voice. The model not only knows the pronunciation of each word, but also knows the pronunciation under which word and the pronunciation under which context, the voice model combines the word information, the syntax information and the phoneme information, and after the predicted voice feature is generated through encoding and decoding, the voice model can also generate corresponding predicted voice based on the predicted voice feature through a vocoder.

In the application stage, after obtaining the phoneme information, the word information and the syntax information of the input text, the server can directly encode the various information through a text encoder, text encoding data are input into a decoder, a voice model generates corresponding predicted voice characteristics in the decoder based on phonemes corresponding to each word in the text encoding data, starting and stopping time of each phoneme and word attributes of each word, and a vocoder generates corresponding predicted voice based on the predicted voice characteristics. In some embodiments, the server may further obtain continuous word and sentence information input by the user, and determine a pause relation in the generated voice in combination with the continuous word and sentence information. The voice generation can be processed through a computer background, and also can be processed through a cloud, so that the processing efficiency is high, and the operation speed is high.

Through the embodiment, the server adds word information and syntax information in the voice synthesis scheme, so that the text encoder of the model encodes the information with three granularities of phonemes, words and syntax at the same time, the model has richer input information, the model is helped to better understand the information of the input words in the grammar and semantic level and the corresponding voice information, and a voice synthesis effect with more natural pause is realized. And the control information based on word segmentation, duration threshold and custom phrases realizes the controllable effect on pause, and improves the accuracy of speech expression of predicted speech and the naturalness of speech listening.

It should be understood that, although the steps in the flowcharts related to the above embodiments are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 6. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is for storing text sample data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a training method of a speech model, a speech generating method.

It will be appreciated by those skilled in the art that the structure shown in fig. 6 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, including a memory and a processor, where the memory stores a computer program, and the processor implements the training method and the speech generating method of the speech model when executing the computer program.

In one embodiment, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the above-described training method, speech generation method of a speech model.

In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the above-described training method of a speech model, speech generating method.

It should be noted that, user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims

1. A method of training a speech model, the method comprising:

2. The method of claim 1, wherein obtaining text structure information of the text sample data comprises:

3. The method of claim 1, wherein the text structure information includes word information and syntax information; the text coding data is obtained according to the text structure information and the phoneme information, and the method comprises the following steps:

4. The method according to claim 1, wherein the method further comprises:

5. The method of claim 4, wherein the obtaining continuous word and sentence information pre-labeled in the text sample data comprises:

and/or the number of the groups of groups,

6. A method of speech generation, the method comprising:

inputting the target text into a trained voice model, obtaining text coding data by a text coder of the voice model according to text structure information and phoneme information of the target text, obtaining predicted voice features by a decoder of the voice model according to the text coding data, inputting the predicted voice features into a vocoder in the voice model, and outputting corresponding predicted voice by the vocoder according to the predicted voice features; the speech model being trained in accordance with the method of any one of claims 1 to 5;

7. The method of claim 6, wherein the obtaining the target text comprises:

acquiring an original text input by a user;

acquiring continuous word and sentence information in the original text;

8. The method of claim 7, wherein the obtaining continuous phrase information in the original text comprises:

and/or the number of the groups of groups,

obtaining a maximum pause time threshold for each group of adjacent words in the original text, and obtaining continuous word and sentence information according to the adjacent words of which the maximum pause time threshold is smaller than a preset threshold;

and/or the number of the groups of groups,

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 8 when the computer program is executed.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 8.