CN113948060A

CN113948060A - Network training method, data processing method and related equipment

Info

Publication number: CN113948060A
Application number: CN202111058068.1A
Authority: CN
Inventors: 郑念祖; 邓利群; 王雅圣
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-09-09
Filing date: 2021-09-09
Publication date: 2022-01-18

Abstract

The embodiment of the application discloses a network training method, a data processing method and related equipment, which are used for a speech synthesis scene. The method in the embodiment of the application comprises the following steps: acquiring a first text and a first voice corresponding to the first text; acquiring a first phoneme sequence of a first text; acquiring a corresponding relation between the first voice and the first phoneme sequence based on an attention mechanism; modifying the corresponding relation based on a dynamic programming method to obtain first time length information of each phoneme in the first phoneme sequence; and training a first prediction network based on the first phoneme sequence and the first duration information to obtain a trained first prediction network, wherein the trained first prediction network is used for predicting duration information of each phoneme in the text to be processed. The dynamic programming can deduce the unaligned phonemes through monotonicity and other modes, and the duration information of the phonemes is obtained through an attention mechanism and a dynamic programming method, so that omission or dislocation of the phonemes is reduced, and the hearing sense of the synthesized speech applied to a speech synthesis scene is improved.

Description

Network training method, data processing method and related equipment

Technical Field

The embodiment of the application relates to the field of voice synthesis, in particular to a network training method, a data processing method and related equipment.

Background

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision and reasoning, human-computer interaction, recommendation and search, AI basic theory, and the like.

At present, with the continuous development of multimedia communication technology, a speech synthesis technology, which is one of important ways of man-machine communication, is receiving wide attention from researchers due to its advantages of convenience and rapidness. In an end-to-end speech synthesis (TTS) scenario, in order to ensure that the listening experience of the synthesized speech is good, a model of the TTS needs to correct the correspondence between phonemes and speech in a training process, so that the speech output by the model is corrected according to the correspondence. The common method for correcting the correspondence between phonemes and speech is forced alignment or the like.

However, in the above forced alignment method, there may be a case where the alignment effect of the phoneme and the speech is poor, that is, there may be a missing of the phoneme.

Disclosure of Invention

The embodiment of the application provides a network training method, a data processing method and related equipment, wherein duration information of phonemes is obtained through an attention mechanism and a dynamic programming method, so that omission of the phonemes is reduced, and the hearing sense of synthesized speech applied to a speech synthesis scene is improved.

A first aspect of the embodiments of the present application provides a network training method, which may be performed by a data processing device (e.g., a terminal device or a server), or may be performed by a component (e.g., a processor, a chip, or a system-on-chip) of the data processing device. The method comprises the following steps: acquiring a first text and a first voice corresponding to the first text; acquiring a first phoneme sequence of a first text; acquiring a corresponding relation between the first voice and the first phoneme sequence based on an attention mechanism, wherein the corresponding relation is used for indicating the duration of each phoneme in the first phoneme sequence in the first voice (or understood as the number of the phonemes occupying the first voice or the duration of the phonemes occupying the first voice); modifying the corresponding relation based on a dynamic programming method to obtain first time length information of each phoneme in the first phoneme sequence; and training a first prediction network based on the first phoneme sequence and the first duration information to obtain a trained first prediction network, wherein the trained first prediction network is used for predicting duration information of each phoneme in the text to be processed. The first speech may be a speech of a single language/dialect, or may be a speech including at least two languages/dialects, which is not limited herein.

In this embodiment, the first time length information of each phoneme in the first text may be obtained based on an attention mechanism and a dynamic programming method, and since the dynamic programming may infer the unaligned phoneme in a monotonicity manner, the probability that the phoneme is estimated incorrectly (for example, the phoneme is misplaced or the phoneme is swallowed) may be reduced. And the trained first prediction network can realize duration prediction of phonemes in the text to be processed. The method is conveniently applied to scenes such as speech synthesis and the like which need phoneme duration information.

Optionally, in a possible implementation manner of the first aspect, the step of: training a first prediction network based on the first phoneme sequence and the first time length information, comprising: and training the first prediction network by taking the first phoneme sequence as the input of the first prediction network and taking the value of a first loss function smaller than a first threshold value as a target to obtain the trained first prediction network, wherein the first loss function is used for expressing the difference between the duration information output by the first prediction network and the first duration information.

In this possible implementation manner, the accuracy of predicting phoneme duration information by the prediction network can be improved by continuously reducing the difference between duration information output by the first prediction network and the first duration information.

Optionally, in a possible implementation manner of the first aspect, the step of: the first voice comprises voices of at least two languages of dialect/ethnic minority language; the method further comprises the following steps: acquiring a second text and a second voice corresponding to the second text, wherein the second voice comprises a voice of a language type/dialect/minority national language in at least two language types/dialect/minority national language; acquiring a second phoneme sequence of a second text; acquiring second duration information of each phoneme in the second phoneme sequence; and training the second prediction network by taking the second phoneme sequence as the input of the second prediction network and taking the value of a second loss function smaller than a second threshold value as a target to obtain the first prediction network, wherein the second loss function is used for expressing the difference between the duration information output by the second prediction network and the second duration information.

In this possible implementation, the requirement for the language/dialect of the speaker is high due to the large demand for the data set of the first voice of the speaker. In order to solve the problem, before a small amount of mixed data sets (a first text and a first voice) are used for training a first prediction network, a large amount of single data sets (namely data sets corresponding to one language/dialect) are obtained, a second prediction network is trained by using the single data sets to obtain the first prediction network, and then the first prediction network is further trained through the mixed data sets to obtain the trained first prediction network.

Optionally, in a possible implementation manner of the first aspect, the step further includes: acquiring a first Mel spectral feature of a first voice; acquiring first pronunciation characteristics, wherein the first pronunciation characteristics are used for describing tone color characteristics of first voice; the first phoneme sequence, the first time length information and the first pronunciation feature are used as input of a first voice synthesis network, the first voice synthesis network is trained by taking the value of a third loss function smaller than a third threshold value as a target to obtain a trained first voice synthesis network and a trained first pronunciation feature, the third loss function is used for representing the difference between a second Mel spectrum feature and the first Mel spectrum feature output by the first voice synthesis network, and the second Mel spectrum feature is obtained after the first time length information is expanded.

In this possible implementation manner, the first speech synthesis network may be trained through the first phoneme sequence, the first pronunciation feature and the first time length information, so that the trained first speech synthesis network may implement cross-lingual or cross-dialect speech synthesis.

Optionally, in a possible implementation manner of the first aspect, the first speech synthesis network includes an encoder and an autoregressive decoder; the method for training the first speech synthesis network by taking the first phoneme sequence and the first time length information as the input of the first speech synthesis network and taking the value of the third loss function smaller than the third threshold value as the target to obtain the trained first speech synthesis network comprises the following steps: acquiring a first feature corresponding to the first phoneme sequence based on an encoder; expanding the first characteristic based on the first time length information to obtain a second characteristic; obtaining a second Mel spectral feature based on the autoregressive decoder and the second feature; and training the encoder and the autoregressive decoder by taking the value of the third loss function smaller than the third threshold value as a target to obtain the trained first speech synthesis network.

In this possible implementation, since the dynamic programming may infer the unaligned phonemes through monotonicity and the like, the probability that the phonemes are estimated incorrectly (e.g., the phonemes are misplaced or the phonemes are swallowed) may be reduced. On one hand, the first feature can be modified through the first time length information, and therefore the speech output by the trained first speech synthesis network reduces the poor speech listening feeling caused by misestimation of phonemes (such as phoneme misplacement or phoneme swallowing). On the other hand, an autoregressive model controlled according to the first time length information is established, and high naturalness and high robustness of speech synthesis are achieved.

Optionally, in a possible implementation manner of the first aspect, the step of: obtaining a second mel-frequency spectrum characteristic based on the autoregressive decoder and the second characteristic, comprising: and inputting the second characteristic into an autoregressive decoder to obtain a second Mel spectral characteristic.

In this possible implementation, on the one hand, the generated mel-frequency spectrum features are more accurate by introducing an encoder and an autoregressive decoder. On the other hand, an autoregressive model controlled according to the first time length information is established, and high naturalness and high robustness of speech synthesis are achieved.

Optionally, in a possible implementation manner of the first aspect, the step of: obtaining a second mel-frequency spectrum characteristic based on the autoregressive decoder and the second characteristic, comprising: performing convolution processing on the second characteristic to obtain a third characteristic; and inputting the third characteristic into an autoregressive decoder to obtain a second Mel spectral characteristic.

In this possible implementation manner, by introducing convolution processing of the third feature, it may be realized that the obtained feature of the fourth feature is more complete, or it is understood that the fourth feature considers the feature of the entire first phoneme sequence or the first text.

Optionally, in a possible implementation manner of the first aspect, the step of: the first voice comprises voices of at least two languages/dialects; the method further comprises the following steps: acquiring a third text and a third voice corresponding to the third text, wherein the third voice comprises a voice of one language/dialect in at least two languages/dialects; acquiring a third phoneme sequence of a third text; acquiring third duration information of each phoneme in the third phoneme sequence; acquiring a second pronunciation characteristic, wherein the second pronunciation characteristic is used for describing the tone color characteristic of the third voice; acquiring a third Mel spectral feature of a third voice; and training the second speech synthesis network by taking the third phoneme sequence, the third duration information and the second pronunciation characteristic as input of the second speech synthesis network and taking the value of a fourth loss function smaller than a fourth threshold value as a target to obtain the first speech synthesis network and the trained second pronunciation characteristic, wherein the fourth loss function is used for representing the difference between a fourth Mel spectrum characteristic and a third Mel spectrum characteristic output by the second speech synthesis network, and the fourth Mel spectrum characteristic is obtained after the third duration information is expanded.

In this possible implementation, the requirement for the language/dialect of the speaker is high due to the large demand for the data set of the first voice of the speaker. In order to solve the problem, before a small amount of mixed data sets (a first text and a first voice) are used to train a first voice synthesis network, a large amount of single data sets (i.e., data sets corresponding to one language/dialect) are obtained, and the single data sets are used to train a second voice synthesis network to obtain the first voice synthesis network, and then the mixed data sets are used to train the first voice synthesis network further to obtain the trained first voice synthesis network.

Optionally, in a possible implementation manner of the first aspect, the dynamic programming method includes a monotonic alignment search MAS method or a niemann-winche algorithm.

In this possible implementation, by applying the MAS method or the niemann-wunsch algorithm to the scene of correcting the correspondence between the voices and the phonemes, the correct alignment effect between the phonemes and the voices can be obtained, so that the voice with a good hearing feeling can be obtained.

A second aspect of the embodiments of the present application provides a data processing method, which may be executed by a data processing device (e.g., a terminal device or a server), or may be executed by a component (e.g., a processor, a chip, or a system-on-chip) of the data processing device. The method comprises the following steps: acquiring a text to be processed; obtaining a phoneme sequence of the text to be processed based on the text to be processed; the duration information of each phoneme in the phoneme sequence is predicted based on a trained prediction network, the trained prediction network is obtained by training based on a first text and first duration information, the first duration information is obtained by modifying a corresponding relation between the first text and a first voice through a dynamic programming method, the first voice is a voice of the first text, and the corresponding relation is used for representing the duration of each phoneme in the phoneme sequence in the first voice (or understood as the number of frames of the phoneme in the first voice or the duration of the phoneme in the first voice). The first speech may be a speech of a single language/dialect, or may be a speech including at least two languages/dialects, which is not limited herein. Similarly, the text to be processed may be a text corresponding to a monolingual language/dialect, or a text corresponding to at least two languages/dialects.

In this embodiment, the duration information of each phoneme in the phoneme sequence obtained by introducing the prediction network in the inference stage is obtained by modifying the corresponding relationship between the first text and the first speech based on a dynamic programming method, so that the problem of inaccurate duration prediction caused by missing phonemes or misplaced phonemes can be avoided.

Optionally, in a possible implementation manner of the second aspect, the step further includes: inputting the phoneme sequence and the duration information into a trained voice synthesis network to obtain Mel spectrum characteristics, wherein the trained voice synthesis network is used for generating voices corresponding to texts; the Mel spectrum feature is converted into voice through a vocoder, and the voice is the voice of a speaker to the text to be processed.

In the possible implementation manner, the duration information can be applied to the speech synthesis scene, so that the generated speech has no omission or dislocation of phonemes, and the hearing of the synthesized speech is improved.

Optionally, in a possible implementation manner of the second aspect, the step of: acquiring pronunciation characteristics, wherein the pronunciation characteristics are used for describing tone characteristics of a speaker; inputting the phoneme sequence and the duration information into a trained speech synthesis network to obtain Mel spectrum characteristics, which comprises: and inputting the phoneme sequence, the duration information and the pronunciation characteristics into the trained speech synthesis network to obtain the Mel spectrum characteristics.

In the possible implementation mode, by introducing the pronunciation characteristics, the voice generated by the voice synthesis network can better accord with the tone corresponding to the pronunciation characteristics, and the tone of the voice is ensured to be consistent with the tone of the pronunciation characteristics.

Optionally, in a possible implementation manner of the second aspect, the step of: the trained speech synthesis network comprises an encoder and an autoregressive decoder; inputting the phoneme sequence, duration information and pronunciation characteristics into a trained speech synthesis network, comprising: acquiring a first feature corresponding to the phoneme sequence based on an encoder; expanding the first characteristic based on the duration information to obtain a second characteristic; acquiring a third feature based on the second feature and the pronunciation feature; and obtaining the Mel spectral characteristics based on the autoregressive decoder and the third characteristics.

In the possible implementation mode, the prediction network is obtained by training through a dynamic programming method, and the dynamic programming method can deduce the unaligned phonemes through monotonicity and other modes, so that the duration effect of prediction network prediction can be improved, the probability that the phonemes are swallowed is reduced, and the speech output by the duration information correction model can be used.

Optionally, in a possible implementation manner of the second aspect, the step of: obtaining a Mel spectral feature based on the autoregressive decoder and the third feature, comprising: and inputting the third characteristic into an autoregressive decoder to obtain a Mel spectral characteristic.

Optionally, in a possible implementation manner of the second aspect, the step of: obtaining a Mel spectral feature based on the autoregressive decoder and the third feature, comprising: performing convolution processing on the third characteristic to obtain a fourth characteristic; and inputting the fourth characteristic into an autoregressive decoder to obtain a Mel spectral characteristic.

In this possible implementation manner, by introducing convolution processing of the third feature, the obtained feature of the fourth feature may be more complete, or it is understood that the feature of the whole text to be processed is considered for the fourth feature.

Optionally, in a possible implementation manner of the second aspect, the text to be processed includes texts in at least two languages/dialects, and the first feature is further used for describing the language/dialect to which the phoneme belongs.

The possible implementation mode can be applied to a multi-language/dialect scene, duration information of a multi-language/dialect text can be obtained based on a prediction network, and correct prediction of the corresponding relation between phonemes and voice is improved.

Optionally, in a possible implementation manner of the second aspect, the dynamic programming method includes a monotonic alignment search MAS method or a niemann-winche algorithm.

Optionally, in a possible implementation manner of the second aspect, the step of: acquiring a text to be processed, comprising: receiving a text to be processed sent by terminal equipment; the method further comprises the following steps: and sending voice to the terminal equipment.

In the possible implementation mode, the data processing device is a server, and in the inference process, the part related to the network is placed in the server, so that the storage and calculation capacity of the terminal device can be saved.

A third aspect of the embodiments of the present application provides a data processing device, where the data processing device may be a terminal device or a server. The data processing apparatus includes: the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a first text and a first voice corresponding to the first text; the acquisition unit is further used for acquiring a first phoneme sequence of the first text; the obtaining unit is further configured to obtain, based on an attention mechanism, a corresponding relationship between the first speech and the first phoneme sequence, where the corresponding relationship is used to indicate a duration of each phoneme in the first phoneme sequence in the first speech (or understood as a frame number of the phoneme in the first speech, or a duration of the phoneme in the first speech); the correction unit is used for correcting the corresponding relation based on a dynamic programming method to obtain first time length information of each phoneme in the first phoneme sequence; and the training unit is used for training a first prediction network based on the first phoneme sequence and the first duration information to obtain a trained first prediction network, and the trained first prediction network is used for predicting duration information of each phoneme in the text to be processed. The first speech may be a speech of a single language/dialect, or may be a speech including at least two languages/dialects, which is not limited herein.

Optionally, in a possible implementation manner of the third aspect, the modifying unit is specifically configured to use the first phoneme sequence as an input of the first prediction network, train the first prediction network with a target that a value of a first loss function is smaller than a first threshold, so as to obtain the trained first prediction network, where the first loss function is used to represent a difference between duration information output by the first prediction network and the first duration information.

Optionally, in a possible implementation manner of the third aspect, the first speech includes speech of at least two languages/dialects; the acquiring unit is further used for acquiring a second text and a second voice corresponding to the second text, wherein the second voice comprises a voice of one language/dialect in at least two languages/dialects; the acquisition unit is also used for acquiring a second phoneme sequence of the second text; the acquisition unit is further used for acquiring second duration information of each phoneme in the second phoneme sequence; and the training unit is further configured to train the second prediction network to obtain the first prediction network by taking the second phoneme sequence as an input of the second prediction network and taking a value of the second loss function smaller than a second threshold as a target, where the second loss function is used to represent a difference between the duration information output by the second prediction network and the second duration information.

Optionally, in a possible implementation manner of the third aspect, the obtaining unit is further configured to obtain a first mel spectrum feature of the first voice; the acquiring unit is further used for acquiring a first pronunciation characteristic, and the first pronunciation characteristic is used for describing the tone color characteristic of the first voice; the training unit is further configured to train the first speech synthesis network with the first phoneme sequence, the first time length information, and the first pitch feature as inputs of the first speech synthesis network, and with a value of a third loss function smaller than a third threshold as a target, to obtain a trained first speech synthesis network and a trained first pitch feature, where the third loss function is used to represent a difference between a second mel spectrum feature output by the first speech synthesis network and the first mel spectrum feature, and the second mel spectrum feature is obtained after the first time length information is extended.

Optionally, in a possible implementation manner of the third aspect, the first speech synthesis network includes an encoder and an autoregressive decoder; the training unit is specifically used for acquiring a first feature corresponding to the first phoneme sequence based on the encoder; the training unit is specifically used for expanding the first characteristic based on the first time length information to obtain a second characteristic; the training unit is specifically used for obtaining a second Mel spectrum characteristic based on the autoregressive decoder and the second characteristic; and the training unit is specifically used for training the encoder and the autoregressive decoder by taking the value of the third loss function smaller than the third threshold value as a target to obtain a trained first speech synthesis network.

Optionally, in a possible implementation manner of the third aspect, the training unit is specifically configured to input the second feature into an autoregressive decoder to obtain a second mel-spectrum feature.

Optionally, in a possible implementation manner of the third aspect, the training unit is specifically configured to perform convolution processing on the second feature to obtain a third feature; and the training unit is specifically used for inputting the third characteristic into the autoregressive decoder to obtain a second Mel spectrum characteristic.

Optionally, in a possible implementation manner of the third aspect, the first speech includes speech of at least two languages/dialects; the acquiring unit is further used for acquiring a third text and a third voice corresponding to the third text, wherein the third voice comprises a voice of one language/dialect in at least two languages/dialects; the acquisition unit is further used for acquiring a third phoneme sequence of a third text; the acquisition unit is further used for acquiring third duration information of each phoneme in the third phoneme sequence; the acquisition unit is further used for acquiring a second pronunciation characteristic, and the second pronunciation characteristic is used for describing the tone color characteristic of the third voice; the acquisition unit is also used for acquiring a third Mel spectrum characteristic of a third voice; the training unit is further configured to train the second speech synthesis network to obtain the first speech synthesis network and the trained second pronunciation feature by taking the third phoneme sequence, the third duration information, and the second pronunciation feature as input of the second speech synthesis network, and taking a value of a fourth loss function smaller than a fourth threshold as a target, where the fourth loss function is used to represent a difference between a fourth mel-spectrum feature output by the second speech synthesis network and the third mel-spectrum feature, and the fourth mel-spectrum feature is obtained after being extended by the third duration information.

Optionally, in a possible implementation manner of the third aspect, the dynamic programming method includes a monotonic alignment search MAS method or a niemann-winche algorithm.

A fourth aspect of the embodiments of the present application provides a data processing device, where the data processing device may be a terminal device or a server. The data processing apparatus includes: the acquisition unit is used for acquiring a text to be processed; the acquisition unit is also used for obtaining a phoneme sequence based on the text to be processed; the acquiring unit is also used for acquiring pronunciation characteristics, and the pronunciation characteristics are used for describing the tone characteristics of a speaker; the prediction unit is used for predicting duration information of each phoneme in the phoneme sequence based on a trained prediction network, the trained prediction network is obtained by training based on a first text and first duration information, the first duration information is obtained by modifying a corresponding relation between the first text and a first voice through a dynamic programming method, the first voice is a voice of the first text, and the corresponding relation is used for representing duration of each phoneme in the phoneme sequence in the first voice (or understanding that the phoneme occupies the number of frames of the first voice, or occupies the duration of the phoneme in the first voice). The first speech may be a speech of a single language/dialect, or may be a speech including at least two languages/dialects, which is not limited herein. Similarly, the text to be processed may be a text corresponding to a monolingual language/dialect, or a text corresponding to at least two languages/dialects.

Optionally, in a possible implementation manner of the fourth aspect, the data processing apparatus further includes: the processing unit is used for inputting the phoneme sequence and the duration information into the trained speech synthesis network to obtain a Mel spectrum characteristic; and the conversion unit is used for converting the Mel spectrum characteristics into voice through the vocoder, wherein the voice is the voice of the speaker to the text to be processed.

Optionally, in a possible implementation manner of the fourth aspect, the obtaining unit is further configured to obtain a pronunciation feature, where the pronunciation feature is used to describe a tone characteristic of a speaker; and the processing unit is specifically used for inputting the phoneme sequence, the duration information and the pronunciation characteristics into the trained speech synthesis network to obtain the Mel spectrum characteristics.

Optionally, in a possible implementation manner of the fourth aspect, the trained speech synthesis network includes an encoder and an autoregressive decoder; the processing unit is specifically used for acquiring a first feature corresponding to the phoneme sequence based on the encoder; the processing unit is specifically used for expanding the first characteristic based on the duration information to obtain a second characteristic; the processing unit is specifically used for acquiring a third feature based on the second feature and the pronunciation feature; and the processing unit is specifically used for obtaining the Mel spectrum characteristic based on the autoregressive decoder and the third characteristic.

Optionally, in a possible implementation manner of the fourth aspect, the processing unit is specifically configured to input the third feature into an autoregressive decoder to obtain the mel-spectrum feature.

Optionally, in a possible implementation manner of the fourth aspect, the processing unit is specifically configured to perform convolution processing on the third feature to obtain a fourth feature; and the processing unit is specifically used for inputting the fourth characteristic into the autoregressive decoder to obtain a Mel spectrum characteristic.

Optionally, in a possible implementation manner of the fourth aspect, the text to be processed includes texts in at least two languages/dialects, and the first feature is further used for describing the language/dialect to which the phoneme belongs.

Optionally, in a possible implementation manner of the fourth aspect, the dynamic programming method includes a monotonic alignment search MAS method or a niemann-winche algorithm.

A fifth aspect of the present application provides a data processing apparatus that performs the method of the first aspect or any possible implementation manner of the first aspect, or performs the method of the second aspect or any possible implementation manner of the second aspect.

A sixth aspect of the present application provides a data processing apparatus comprising: a processor coupled to a memory for storing a program or instructions which, when executed by the processor, cause the data processing apparatus to carry out the method of the first aspect or any possible implementation of the first aspect described above, or the method of the second aspect or any possible implementation of the second aspect described above.

A seventh aspect of the present application provides a computer-readable medium having stored thereon a computer program or instructions which, when run on a computer, causes the computer to perform the method of the aforementioned first aspect or any possible implementation of the first aspect, or causes the computer to perform the method of the aforementioned second aspect or any possible implementation of the second aspect.

An eighth aspect of the present application provides a computer program product which, when executed on a computer, causes the computer to perform the method of the first aspect or any possible implementation manner of the first aspect, or causes the computer to perform the method of the second aspect or any possible implementation manner of the second aspect.

For technical effects brought by any one of the third, fifth, sixth, seventh, and eighth aspects or any one of possible implementation manners, reference may be made to technical effects brought by the first aspect or different possible implementation manners of the first aspect, and details are not described here.

For example, the technical effect brought by any one of the fourth, fifth, sixth, seventh and eighth aspects or any one of the possible implementation manners of the fourth aspect may refer to the technical effect brought by the second aspect or the different possible implementation manners of the second aspect, and details are not described here.

According to the technical scheme, the embodiment of the application has the following advantages: the first time length information of each phoneme in the first text can be obtained based on an attention mechanism and a dynamic programming method, and the probability that the phoneme is estimated incorrectly (for example, the phoneme is misplaced or the phoneme is swallowed) can be reduced because the dynamic programming can deduce the unaligned phoneme in a monotonicity way and the like. Thereby achieving the correct alignment of the phonemes in the first text with the first speech. The method is conveniently applied to scenes such as speech synthesis and the like which need phoneme duration information.

Drawings

FIG. 1 is a block diagram of a system architecture provided herein;

FIG. 2 is a schematic diagram of a convolutional neural network structure provided in the present application;

FIG. 3 is a schematic diagram of another convolutional neural network structure provided in the present application;

fig. 4 is a schematic diagram of a chip hardware structure provided in the present application;

fig. 5 is a schematic flow chart of a network training method provided in the present application;

FIG. 6 is a schematic diagram of an architecture of an attention network provided herein;

FIG. 7 is a diagram illustrating alignment of a phonetic text before correction according to the present application;

FIG. 8 is a diagram illustrating a modified alignment of a phonetic text according to the present application;

FIG. 9 is a schematic flow chart of another network training method provided in the present application;

FIG. 10 is a schematic diagram of a speech synthesis network according to the present application;

FIG. 11 is a schematic diagram of another structure of a speech synthesis network provided in the present application;

FIG. 12 is a schematic diagram of a training architecture for a first network and a second network provided herein;

FIG. 13 is a schematic diagram of another training architecture for the first network and the second network provided in the present application;

FIG. 14 is a schematic diagram of a relationship structure between a speech synthesis network and an attention network provided in the present application;

FIG. 15 is a schematic flow chart of a data processing method provided herein;

FIG. 16 is a schematic diagram of another structure of a speech synthesis network provided in the present application;

FIG. 17 is a schematic flow chart of a data processing method according to an embodiment of the present application;

fig. 18 to fig. 21 are schematic diagrams of several structures of a data processing device in the embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

For ease of understanding, the relevant terms and concepts to which the embodiments of the present application relate generally will be described below.

1. Neural network

The neural network may be composed of neural units, which may be referred to as X_sAnd an arithmetic unit with intercept 1 as input, the output of which may be:

wherein s is 1,2, … … n, n is a natural number greater than 1, and W is_sIs X_sB is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit into an output signal. The output signal of the activation function may be used as an input to the next convolutional layer. The activation function may be a sigmoid function. A neural network is a network formed by a number of the above-mentioned single neural units joined together, i.e. the output of one neural unit may be the input of another neural unit. The input of each neural unit can be connected with the local receiving domain of the previous layer to extract the characteristics of the local receiving domain, and the local receiving domain can be a region composed of a plurality of neural units.

2. Deep neural network

Deep Neural Networks (DNNs), also known as multi-layer neural networks, can be understood as neural networks having many hidden layers, where "many" has no particular metric. From the division of DNNs by the location of different layers, neural networks inside DNNs can be divided into three categories: input layer, hidden layer, output layer. Generally, the first layer is an input layer, the last layer is an output layer, and the middle layers are hidden layers. The layers are all connected, that is, any neuron of the ith layer is necessarily connected with any neuron of the (i + 1) th layer. Of course, the deep neural network may not include the hidden layer, and is not limited herein.

The operation of each layer in the deep neural network can be expressed mathematically

To describe: the work from each layer in the physical-level deep neural network can be understood as the conversion of an input space into an output space (i.e. the row space to the column space of a matrix) is accomplished by five operations on the input space (a set of input vectors), which include: 1. ascending/descending dimensions; 2. zooming in/out; 3. rotating; 4. translating; 5. "bending". Wherein 1,2, 3 are operated by

Finish, operation 4 is performed by

The operation of 5 is completed by α (). The expression "space" is used herein because the object being classified is not a single thing, but a class of things, and space refers to the collection of all individuals of such things. Where W is a weight vector, each value in the vector representing a weight value for a neuron in the layer of neural network. The vector W determines the above spatial transformation of input space to output space, i.e. the weight W of each layer controls how the space is transformed. The purpose of training the deep neural network is to finally obtain the weight matrix (the weight matrix formed by the vectors W of many layers) of all layers of the trained neural network. Therefore, the training process of the neural network is essentially a way of learning the control space transformation, and more specifically, the weight matrix.

3. Convolutional neural network

A Convolutional Neural Network (CNN) is a deep neural network with a convolutional structure. The convolutional neural network includes a feature extractor consisting of convolutional layers and sub-sampling layers. The feature extractor may be viewed as a filter and the convolution process may be viewed as convolving the same trainable filter with an input image or convolved feature plane (feature map). The convolutional layer is a neuron layer for performing convolutional processing on an input signal in a convolutional neural network. In convolutional layers of convolutional neural networks, one neuron may be connected to only a portion of the neighbor neurons. In a convolutional layer, there are usually several characteristic planes, and each characteristic plane may be composed of several neural units arranged in a rectangular shape. The neural units of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights may be understood as the way image information is extracted is location independent. The underlying principle is: the statistics of a certain part of the image are the same as the other parts. Meaning that image information learned in one part can also be used in another part. The same learned acquired image information can be used for all positions on the image. In the same convolution layer, a plurality of convolution kernels can be used to extract different image information, and generally, the greater the number of convolution kernels, the more abundant the image information reflected by the convolution operation.

The convolution kernel can be initialized in the form of a matrix of random size, and can acquire reasonable weight through learning in the training process of the convolutional neural network. In addition, sharing weights brings the direct benefit of reducing connections between layers of the convolutional neural network, while reducing the risk of overfitting. The networks such as the separation network, the identification network, the detection network, and the depth estimation network in the embodiment of the present application may all be CNNs.

4. Recurrent Neural Network (RNN)

In the traditional neural network model, all layers are connected, and nodes between each layer are connectionless. But such a general neural network is not solved for many problems. For example, it is predicted what the next word of a sentence is, because the preceding and following words in a sentence are not independent, and the preceding word is generally needed. A Recurrent Neural Network (RNN) means that the current output of a sequence is also related to the previous output. The specific expression is that the network memorizes the previous information, stores the previous information in the internal state of the network and applies the previous information to the calculation of the current output.

5. Loss function

In the process of training the deep neural network, because the output of the deep neural network is expected to be as close to the value really expected to be predicted as possible, the weight vector of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the really expected target value (of course, an initialization process is usually carried out before the first updating, namely parameters are configured in advance for each layer in the deep neural network), for example, if the predicted value of the network is high, the weight vector is adjusted to be slightly lower, and the adjustment is carried out continuously until the neural network can predict the really expected target value. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the deep neural network becomes the process of reducing the loss as much as possible.

6. Forcing teaching (teacher forting)

The forced teaching is a network training method and is very important for developing deep learning language models for machine translation, text summarization, image captions and many other application programs. It does not use the output of the last state at a time as input for the next state, but directly uses the corresponding last term of the standard answer (ground route) of the training data as input for the next state. Or as providing the actual output of the network as input for the next moment.

7. From text to speech

Text To Speech (TTS) is a program or software system that converts text to speech.

8. Vocoder

A vocoder is a sound signal processing module or software that encodes acoustic features to generate sound waveforms.

9. Autoregressive model

An autoregressive model (autoregressive model) refers to predicting future features based on features of past information.

10. Forced alignment

Force alignment refers to the process of querying the text of an audio file according to a pronunciation dictionary for the action time information of the phoneme of each word in the audio.

9. Fundamental frequency

When a sound-producing body produces sound due to vibration, the sound can be generally decomposed into a plurality of pure sine waves, that is, all natural sounds are basically composed of a plurality of sine waves with different frequencies, wherein the sine wave with the lowest frequency is a fundamental tone (i.e., a fundamental frequency, which can be represented by F0), and the other sine waves with higher frequencies are overtones.

10. Rhythm

In the field of speech synthesis, prosody broadly refers to features that control the functions of intonation, pitch, accent emphasis, pause, and tempo. Prosody may reflect the emotional state of the speaker or the form of speech, etc.

11. Phoneme

Phoneme (phone): the minimum phonetic unit is divided according to the natural attributes of the speech, and is analyzed according to the pronunciation action in the syllable, and one action forms a phoneme. Phonemes are divided into two major categories, vowels and consonants. For example, the Chinese syllable a (e.g., one sound: o) has only one phoneme, ai (e.g., four sounds: ai) has two phonemes, dai (e.g., one sound: slow) has three phonemes, etc.

12. Word vector (Embedding)

The word vector may also be referred to as "word embedding", "vectorization", "vector mapping", "embedding", and the like. Formally, a word vector represents an object with a dense vector, such as: a sequence of phonemes in a text is represented by a vector, etc.

At present, in the field of multi-language speech synthesis (ML-TTS) widely used in daily life, different timbres of other languages are converted into target timbres through (VC) to realize construction of a multi-language data set (polyglot speech corps) to realize multi-language speech synthesis. However, the above method generally adopts the existing single-language end-to-end speech synthesis system, and learns the cross-language synthesis capability in the data set, but the method is limited by the VC conversion effect, and the conversion completion and the sound quality of the target tone cannot be guaranteed.

In order to solve the above problems, the present application provides a data processing method, which can reduce the omission of phonemes for a trained model, thereby improving the listening experience of multi-language speech output by the model.

First, a system architecture provided in the embodiments of the present application is described.

Referring to fig. 1, a system architecture 10 is provided in accordance with an embodiment of the present application. As shown in system architecture 10, data collection device 16 is configured to collect training data, which in the present embodiment includes training speech and training text corresponding to the training speech. And stores the training data in database 13 and training device 12 trains to obtain target model/rule 101 based on the training data maintained in database 13. How the training device 12 obtains the target model/rule 101 based on the training data will be described in more detail below, where the target model/rule 101 can be used to implement the data processing method provided in the embodiment of the present application, that is, inputting the text into the target model/rule 101 after relevant preprocessing, so as to obtain the phonetic information of the speaker for the text. The target model/rule 101 in the embodiment of the present application may specifically be a speech synthesis network. It should be noted that, in practical applications, the training data maintained in the database 13 is not necessarily all acquired by the data acquisition device 16, and may be received from other devices. It should be noted that, the training device 12 does not necessarily perform the training of the target model/rule 101 based on the training data maintained by the database 13, and may also obtain the training data from the cloud or other places for performing the model training.

The target model/rule 101 obtained by training according to the training device 12 may be applied to different systems or devices, for example, the execution device 11 shown in fig. 1, where the execution device 11 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an Augmented Reality (AR)/Virtual Reality (VR), a vehicle-mounted terminal, or a server or a cloud. In fig. 1, the execution device 11 is configured with an I/O interface 112 for data interaction with an external device, and a user may input data to the I/O interface 112 through the client device 14, where the input data may include, in an embodiment of the present application: the first text and pronunciation characteristics are used for describing the tone color characteristics of the speaker. In addition, the input data may be input by a user, may also be uploaded by the user through other devices, and may also be from a database, which is not limited herein.

The preprocessing module 113 is configured to perform preprocessing according to the first text received by the I/O interface 112, for example, preparations such as converting the first text into phonemes, predicting prosody of the first text, and performing normalization processing on an irregular text.

In the process that the execution device 11 performs preprocessing on the input data or in the process that the calculation module 111 of the execution device 11 performs calculation and the like, the execution device 11 may call data, codes and the like in the data storage system 15 for corresponding processing, and may store data, instructions and the like obtained by corresponding processing into the data storage system 15.

Finally, the I/O interface 112 returns the processing result, such as the speaker's voice information for the first text obtained as described above, to the client device 14 for presentation to the user.

It should be noted that the training device 12 may generate corresponding target models/rules 101 for different targets or different tasks based on different training data, and the corresponding target models/rules 101 may be used to achieve the targets or complete the tasks, so as to provide the user with the required results or provide the input for the subsequent other processes.

In the case shown in fig. 1, the user may manually give the input data, which may be operated through an interface provided by the I/O interface 112. Alternatively, the client device 14 may automatically send the input data to the I/O interface 112, and if the client device 14 is required to automatically send the input data to obtain authorization from the user, the user may set the corresponding permissions in the client device 14. The user can view the result output by the execution device 11 at the client device 14, and the specific presentation form can be display, sound, action and the like. The client device 14 may also serve as a data collection terminal, and collects input data of the input I/O interface 112 and output results of the output I/O interface 112 as new sample data, as shown in the figure, and stores the new sample data in the database 13. Of course, the input data inputted to the I/O interface 112 and the output result outputted from the I/O interface 112 as shown in the figure may be directly stored in the database 13 as new sample data by the I/O interface 112 without being collected by the client device 14.

It should be noted that fig. 1 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the position relationship between the devices, modules, and the like shown in the diagram does not constitute any limitation, for example, in fig. 1, the data storage system 15 is an external memory with respect to the execution device 11, and in other cases, the data storage system 15 may also be disposed in the execution device 11.

As shown in fig. 1, a target model/rule 101 is obtained by training according to the training device 12, where the target model/rule 101 may be a neural network in the embodiment of the present application, and specifically, in the network provided in the embodiment of the present application, the speech synthesis network may be a recurrent neural network, a long-short term memory network, or the like. The prediction network may be a convolutional neural network, a cyclic neural network, or the like.

Optionally, the speech synthesis network and the prediction network in the embodiment of the present application may be two separate networks, or may be a multi-task neural network, where one task is to output durations of phonemes of the text, and the other task is to output speech information corresponding to the text.

Since CNN is a very common neural network, the structure of CNN will be described in detail below with reference to fig. 2. As described in the introduction of the basic concept, the convolutional neural network is a deep neural network with a convolutional structure, and is a deep learning (deep learning) architecture, and the deep learning architecture refers to performing multiple levels of learning at different abstraction levels through a machine learning algorithm. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons can respond to text input into them.

As shown in fig. 2, Convolutional Neural Network (CNN)100 may include an input layer 110, a convolutional/pooling layer 120, and a neural network layer 130, where the pooling layer is optional.

Convolutional layer/pooling layer 120:

and (3) rolling layers:

as shown in FIG. 2, convolutional layer/pooling layer 120 may include, for example, 121-126 layers, in one implementation, 121 layers are convolutional layers, 122 layers are pooling layers, 123 layers are convolutional layers, 124 layers are pooling layers, 125 layers are convolutional layers, and 126 layers are pooling layers; in another implementation, 121, 122 are convolutional layers, 123 are pooling layers, 124, 125 are convolutional layers, and 126 are pooling layers. I.e., the output of a convolutional layer may be used as input to a subsequent pooling layer, or may be used as input to another convolutional layer to continue the convolution operation.

Taking convolutional layer 121 as an example, convolutional layer 121 may include a plurality of convolution operators, also called kernels, whose function in the phoneme sequence processing is to act as a filter for extracting specific information from the phonemes, and the convolution operator may be essentially a weight matrix, which is usually predefined, and during the convolution operation on the phonemes, the weight matrix is usually processed from phoneme to phoneme (or from two phonemes to two phonemes … …, which depends on the value of stride), so as to complete the task of extracting specific features from the phonemes. It should be noted that the depth dimension (depth dimension) of the weight matrix and the depth dimension of the phoneme are the same, and the weight matrix may extend to the entire depth of the phoneme during the convolution operation. Thus, convolving with a single weight matrix will produce a single depth dimension of the convolved output, but in most cases not a single weight matrix is used, but a plurality of weight matrices of the same dimension are applied. The outputs of each weight matrix are stacked to form the depth dimension of the convolutional phonemes. Different weight matrixes can be used for extracting different characteristics in the phonemes, for example, one weight matrix is used for extracting the language/dialect characteristics of the phonemes, the other weight matrix is used for extracting the prosody … … of the phonemes, the dimensions of the multiple weight matrixes are the same, the dimensions of feature graphs extracted by the multiple weight matrixes with the same dimensions are also the same, and then the extracted feature graphs with the same dimensions are combined to form the output of convolution operation.

The weight values in these weight matrices need to be obtained through a large amount of training in practical application, and each weight matrix formed by the trained weight values can extract information from the phoneme sequence, thereby helping the convolutional neural network 100 to make correct prediction.

When convolutional neural network 100 has multiple convolutional layers, the initial convolutional layer (e.g., 121) tends to extract more general features, which may also be referred to as low-level features; as the depth of the convolutional neural network 100 increases, the more convolutional layers (e.g., 126) that go further back extract more complex features, such as features with high levels of semantics, the more highly semantic features are more suitable for the problem to be solved.

A pooling layer:

since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce pooling layers after the convolutional layer, i.e. the layers 121-126 as illustrated by 120 in fig. 2, may be one convolutional layer followed by one pooling layer, or may be multiple convolutional layers followed by one or more pooling layers. In the text processing process (which may also be referred to as phoneme sequence processing), the only purpose of the pooling layer is to reduce the spatial size of the phonemes. The pooling layer may include an average pooling operator and/or a max pooling operator for sampling phonemes to smaller sized phonemes. The neural network layer 130:

after processing by convolutional layer/pooling layer 120, convolutional neural network 100 is not sufficient to output the required output information. Because, as previously described, convolutional layer/pooling layer 120 will only extract features and reduce parameters due to phoneme sequences. However, to generate the final output information (class information or other relevant information as needed), the convolutional neural network 100 needs to generate one or a set of outputs of the number of classes as needed using the neural network layer 130. Accordingly, a plurality of hidden layers (such as 131, 132 to 13n shown in fig. 2) and an output layer 140 may be included in the neural network layer 130, and parameters included in the hidden layers may be obtained by pre-training according to related training data of a specific task type, for example, the task type may include speech generation of a text, phoneme duration prediction of the text, and the like.

After the hidden layers in the neural network layer 130, i.e. the last layer of the whole convolutional neural network 100 is the output layer 140, the output layer 140 has a loss function similar to the class cross entropy, and is specifically used for calculating the prediction error, once the forward propagation (i.e. the propagation from 110 to 140 in fig. 2 is the forward propagation) of the whole convolutional neural network 100 is completed, the backward propagation (i.e. the propagation from 140 to 110 in fig. 2 is the backward propagation) starts to update the weight values and the bias of the aforementioned layers, so as to reduce the loss of the convolutional neural network 100 and the error between the result output by the convolutional neural network 100 through the output layer and the ideal result.

It should be noted that the convolutional neural network 100 shown in fig. 2 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models, for example, as shown in fig. 3, a plurality of convolutional layers/pooling layers are parallel, and the features extracted respectively are all input to the overall neural network layer 130 for processing.

A hardware structure of a chip provided in an embodiment of the present application is described below.

Fig. 4 is a hardware structure of a chip provided in an embodiment of the present application, where the chip includes a neural network processor 40. The chip may be provided in the execution device 110 as shown in fig. 1 to complete the calculation work of the calculation module 111. The chip may also be disposed in the training apparatus 120 as shown in fig. 1 to complete the training work of the training apparatus 120 and output the target model/rule 101. The algorithm for each layer in the convolutional neural network shown in fig. 2 can be implemented in a chip as shown in fig. 4.

The neural network processor 40 may be any processor suitable for large-scale exclusive or operation processing, such as a neural-Network Processing Unit (NPU), a Tensor Processing Unit (TPU), or a Graphics Processing Unit (GPU). Taking NPU as an example: the neural network processor 40 is mounted as a coprocessor on a main Central Processing Unit (CPU) (host CPU), and tasks are allocated by the main CPU. The core portion of the NPU is an arithmetic circuit 403, and a controller 404 controls the arithmetic circuit 403 to extract data in a memory (weight memory or input memory) and perform an operation.

In some implementations, the arithmetic circuit 403 includes a plurality of processing units (PEs) therein. In some implementations, the operational circuitry 403 is a two-dimensional systolic array. The arithmetic circuit 403 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry 403 is a general-purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to matrix B from the weight memory 402 and buffers it on each PE in the arithmetic circuit. The arithmetic circuit takes the matrix a data from the input memory 401 and performs matrix operation with the matrix B, and partial or final results of the obtained matrix are stored in the accumulator 408.

The vector calculation unit 407 may further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like. For example, the vector calculation unit 407 may be used for network calculations of non-convolution/non-FC layers in a neural network, such as pooling (pooling), batch normalization (batch normalization), local response normalization (local response normalization), and the like.

In some implementations, the vector calculation unit 407 can store the processed output vector to the unified buffer 406. For example, the vector calculation unit 407 may apply a non-linear function to the output of the arithmetic circuit 403, such as a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit 407 generates normalized values, combined values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the arithmetic circuitry 403, for example for use in subsequent layers in a neural network.

The unified memory 406 is used to store input data as well as output data.

The weight data directly passes through a memory unit access controller 405 (DMAC) to carry the input data in the external memory to the input memory 401 and/or the unified memory 406, store the weight data in the external memory into the weight memory 402, and store the data in the unified memory 506 into the external memory.

A Bus Interface Unit (BIU) 410, configured to implement interaction between the main CPU, the DMAC, and the instruction fetch memory 409 through a bus.

An instruction fetch buffer 409 connected to the controller 404 is used for storing instructions used by the controller 404.

The controller 404 is configured to call an instruction cached in the instruction memory 409 to implement controlling of a working process of the operation accelerator.

Generally, the unified memory 406, the input memory 401, the weight memory 402 and the instruction fetch memory 409 are On-Chip (On-Chip) memories, the external memory is a memory outside the NPU, and the external memory may be a double data rate synchronous dynamic random access memory (DDR SDRAM), a High Bandwidth Memory (HBM) or other readable and writable memories.

The operation of each layer in the convolutional neural network shown in fig. 2 or fig. 3 may be performed by the operation circuit 403 or the vector calculation unit 407.

The speech synthesis network, the prediction network training method, and the data processing method according to the embodiments of the present application will be described in detail below with reference to the accompanying drawings.

The speech synthesis neural network and the prediction network in the embodiment of the application may be two separate networks, or may be a multi-task neural network, where one task is to output the duration of each phoneme of the text, and the other task is to output speech information corresponding to the text.

Next, a method for training a speech synthesis network according to an embodiment of the present application will be described in detail with reference to fig. 5. The training method shown in fig. 5 may be performed by a training device of a speech synthesis network, which may be a cloud service device, a terminal device, for example, a device with sufficient computing power such as a computer or a server, which is used to perform the training method of the speech synthesis network, or a system composed of the cloud service device and the terminal device. Illustratively, the training method may be performed by the training apparatus 120 in fig. 1, the neural network processor 40 in fig. 4.

Optionally, the training method may be processed by a CPU, or may be processed by both the CPU and the GPU, or may use other processors suitable for neural network computation instead of the GPU, which is not limited in this application.

The training method shown in fig. 5 includes

steps

501 and 502. Step 501 and step 502 are described in detail below.

Step 501, a first text and a first voice corresponding to the first text are obtained.

In the embodiment of the present application, the manner of acquiring the first text and the first voice (which may also be understood as training data) corresponding to the first text may be to acquire the first text first, and then acquire the first text by recognizing the first voice. The first text and the first voice may also be directly obtained, and the details are not limited herein. The first speech includes speech of at least one kind of language/dialect, for example, the first speech includes speech of at least one kind of language/dialect, or the first speech includes speech of at least two kinds of languages/dialects. If the first speech includes speech of a language/dialect, the embodiment of the present application can be applied to a synthesis scenario of a single language/dialect. If the first speech includes at least two types of languages/dialects, the embodiment of the present application can be applied to a synthesis scenario of multiple languages/dialects.

In the embodiment of the present application, only the language/dialect is taken as an example for description, it is understood that the first speech may also be speech in other forms such as at least one minority language (minor language), and the specific details are not limited herein.

Optionally, the training data may further comprise dialect identifiers, since the text corresponding to the at least two dialect voices may be the same text.

The language classification in the embodiment of the present application has various situations, which may refer to languages in different countries, and may also refer to languages divided by different language families or language families, for example: hindu language family, Tibetan language family, and non-Taiyang language family. It may also refer to six working languages specified by the united nations: chinese, english, russian, arabic, french, and spanish. In practical applications, the language classification may also be used in other situations, and is not limited herein.

The dialect classification in the embodiment of the present application has various conditions, which may be: northern dialect, Wu dialect, Xiang dialect, Hakka dialect, Min dialect, Guangdong dialect, Gannan dialect and jin dialect. But also more specific local dialects such as Nanchang, Guangzhou, Changsha, Guangdong, Minnan, Chaoshan, Hakka, northeast, Beijing, etc. It is to be understood that the dialect may also be, for example, american english, english, etc. In practical applications, dialect classification of other situations may also be possible, and the details are not limited herein.

Illustratively, if the first text is a multilingual/dialect text, the first text includes chinese and english, and the first speech includes chinese speech corresponding to chinese and english speech corresponding to english. If the first text is a monolingual/dialect text, the first text includes Chinese or English, and the first voice includes Chinese voice corresponding to Chinese or English voice corresponding to English

Illustratively, the first text is "gunn boy", and the corresponding speech of "gunn" is chinese speech. "boy" corresponds to English language speech. Of course, the first text may also be in a single language of "weather is good today". Of course, "today is very good" may also correspond to two types of dialects (e.g., northeast and tianjin).

Optionally, if the number of speakers (or users) is multiple, the training data in the training data may further include speaker identifiers, or include tone features of the first speech, or include voiceprint features of the first speech, etc., for the subsequently predicted speech information to be correct.

In the embodiment of the application, the training data may be acquired by directly recording the sound of a sound object (e.g., a speaker), or by inputting audio information and video information by a user, or by receiving the audio information and video information sent by a collection device, and in practical application, the training data may be acquired in other manners, and the acquisition manner of the training data is not limited herein.

Step 502, based on the first text, a first phoneme sequence is obtained.

This step may also be understood as a pre-processing of the training data, for example, if the training data described above includes the first speech, the first text may be obtained in a manner of recognizing the first speech, and the first text may be represented by the first phoneme sequence. For example, if the training data described above includes a first text in which the first speech corresponds to the first speech, the preprocessing may include processing the first text to obtain a first phoneme sequence of the first text, and the preprocessing may include at least one of text recognition of the first speech, processing of the first text to the first phoneme sequence, prosody prediction of the first text, and normalization processing of the first text. Wherein, normalization processing is performed on the information which is not normalized in the first text (for example, the normalized text is identified). For example: non-normative information is expressions (e.g. smiling, laughing), abbreviations (e.g. 1.4cm), symbols (e.g. colons), numbers. The phoneme sequence mentioned in the embodiments of the present application can be understood as a phoneme sequence containing prosody. It is to be understood that the prediction manner of the prosody of the first text is not limited herein.

Illustratively, the smile expression is normalized to obtain the text of the laugh. The smile expression is subjected to standardization processing to obtain a text of 'haha'. The "1.4 cm" is normalized to obtain a text of "1.4 cm". "1: 3 'is normalized to obtain a text of 1 to 3'.

In a possible implementation manner, the first text is a text corresponding to a kind of language/dialect, and the first phoneme sequence may be obtained through a word-to-phoneme (G2P) model corresponding to the kind of language/dialect. And obtaining the prosody of the first phoneme sequence through prosody prediction, and then updating the first phoneme sequence by using the prosody.

Illustratively, continuing the above example, the first text is "this one subtest", and the first phoneme sequence is "zhe 4 shi4 yi2 ge4 xiao3 ce4 shi 4". Wherein 1 to 4 represent tones, for example: 1 indicates flat sound, 2 indicates upward sound, 3 indicates upward sound, and 4 indicates inward sound. The prosody of the first text is "this #1 is #1 to #2 small #1 test # 1". In these, #1 and #2 are different prosody. Correspondingly, the first phoneme sequence of the first text is "sos zh e4 SP1 sh ii4 SP1 i2/g e4 SP2 x iao3 SP 1c e4/sh ii4 SP1 eos". Where sos denotes the beginning of the first phoneme sequence and eos denotes the end of the first phoneme sequence. SP1 corresponds to #1 and SP2 corresponds to # 2.

In another possible implementation manner, the first text is a text corresponding to two languages/dialects, and the first phoneme sequence may be obtained through G2P models corresponding to the two languages/dialects, respectively. The first text comprises a first sub-text and a second sub-text, the first sub-text is related to one of at least two languages, and the second sub-text is related to another of the at least two languages; alternatively, the first sub-text is associated with one of the at least two dialects and the second sub-text is associated with another of the at least two dialects. The obtaining of the first phoneme sequence of the first text may specifically include: and acquiring a phoneme sequence of the first sub-text, acquiring a phoneme sequence of the second sub-text, and splicing the phoneme sequence of the first sub-text and the phoneme sequence of the second sub-text to obtain a first phoneme sequence of the first text.

Illustratively, the first text is "this one subtest", the first sub-text is "this one subtest", and the first phoneme sequence is "zhe 4 shi4 yi2 ge4 xiao3 ce4 shi 4". The second sub-text is "case" and the second phoneme sequence is "K EY 1S". Wherein 1 to 4 represent tones, for example: 1 indicates flat sound, 2 indicates upward sound, 3 indicates upward sound, and 4 indicates inward sound.

Illustratively, the prosody of the first text is "these #1 are #1 and #2 are small #1 test #1case # 3". Of these, #1, #2, and #3 are different prosody. Correspondingly, the first phoneme sequence of the first text is "sos zh e4 SP1 sh ii4 SP1 i2/g e4 SP2 x iao3 SP 1c e4/sh ii4 SP 1K EY 1S pau eos". Where sos denotes the beginning of the first phoneme sequence and eos denotes the end of the first phoneme sequence. SP1 corresponds to #1, SP2 corresponds to #2, and pau is # 3.

It can be understood that, if the first sub-text in the first text is segmented by the second sub-text (for example, the first text is "hello, plum, nice to meet you", and the first sub-text "hello, nice to meet you" is segmented by the second sub-text "plum"), the position information of the first sub-text and the second sub-text in the text to be processed needs to be acquired, and the first feature may be obtained by subsequently splicing the first sub-feature and the second sub-feature based on the position information.

Step 503, acquiring a corresponding relation between the first speech and the first phoneme sequence based on the attention mechanism.

The attention mechanism mimics the internal process of biological observation behavior, i.e., a mechanism that aligns internal experience with external perception to increase the fineness of observation of a partial region. An attention mechanism may quickly extract important features of sparse data, described below in connection with an example. The attention mechanism may be a self-attention (self-attention) mechanism or a location sensitive attention (location sensitive attention) mechanism, and the like, and is not limited herein.

This step may also be understood as obtaining the correspondence of the first speech to the first phoneme sequence using an attention network. The correspondence is used to indicate the duration of each phoneme in the phoneme sequence in the first speech (or understood as the number of frames of the phoneme in the first speech, or the duration of the phoneme in the first speech).

Illustratively, the Attention network may refer to fig. 6, and the Attention network includes an Encoder (Encoder), an Attention module (Attention), an autoregressive Encoder (Decoder), a Duration extraction module (e.g., MAS), and a prediction network (Duration). The attention network training process takes the first text as the input of the attention network, and the attention network is trained by taking the value of the first loss function smaller than the first threshold value as the target. Wherein the first loss function is used to represent a difference between a Mel spectral feature of the attention network output and an actual Mel spectral feature of the first speech. The task of the attention network is to determine the correspondence of the first speech to the first phoneme sequence in the process of predicting the mel-frequency spectrum features of the first text.

Wherein the encoder is operative to encode the first text as a text vector. The function of the autoregressive decoder is to obtain the speech features (such as spectral features or mel-frequency features) corresponding to the text according to the first vector. In the training process of the autoregressive decoder, the real voice characteristics corresponding to the previous step are calculated in each step and serve as conditions for calculation. In addition, the attention module functions similarly to an extension, and is used to focus attention on an appropriate position in the text vector, so that the generated speech more conforms to the characteristics of the speaker. For example: the Attention module calculates position sensitive Score Attention Score (Location sensitive Attention Score) with the output of the encoder through an autoregressive loop network (RNN) using the output of the autoregressive decoder at training time (i.e. the current speech frame output) as a Query vector (Query), thereby obtaining content information (Context vector). Or, for each time step of the autoregressive decoder, the attention RNN of the autoregressive decoder generates a vector, performs attention calculation with the output of the coder, that is, obtains an attention score, and based on the score, the autoregressive decoder can determine which part of the coder output is related to the next frame prediction (this part is content information), so as to repeat the process, and further determine the corresponding relationship between the first speech and the first phoneme sequence, specifically, refer to the autoregressive model principle. The description of the duration extraction module (e.g., MAS) may refer specifically to step 504.

Further, for a multilingual/dialect scenario (i.e., the first speech includes at least two languages/dialects), the requirement for the language/dialect of the speaker is high due to the large demand for the data set of the first speech of the speaker. To solve this problem, before training the attention network using the mixed data set (the first text and the first speech), a single data set (i.e. a data set corresponding to one language/dialect) is obtained, and the attention network is trained using the single data set, and then the attention network is further trained through the mixed data set, so as to obtain a trained attention network. The single data set includes a second text and a second speech corresponding to the second text, the second speech including speech in a language/dialect.

Illustratively, please refer to fig. 7, the correspondence between the first speech and the first phoneme sequence is obtained through the attention network. The abscissa is the number of decoding steps of the autoregressive Decoder (Decoder) and the ordinate corresponds to the vertical bars of the encoded part of fig. 6. The abscissa can also be understood as speech and the ordinate as phoneme.

In addition, pronunciation features (Speaker embedding) may be added to the text vector obtained by the encoder, and the pronunciation features may include a voiceprint feature and the like for describing pronunciation features of a Speaker or a tone feature of the Speaker. The pronunciation characteristics may be obtained by a look-up table (LUP), speaker recognition (SV), or a neural network, and the details are not limited herein.

Step 504, the corresponding relationship is modified based on the dynamic programming method to obtain the first time length information of each phoneme in the first phoneme sequence.

In order to achieve the correct alignment of the first speech and the first phoneme sequence, the embodiment of the present application may obtain the first duration information (or understood as corrected phoneme duration information) of each phoneme in the first phoneme sequence based on a dynamic programming method and an attention mechanism. The dynamic programming method is a method that can modify the correspondence between the speech and the phoneme, for example: the dynamic programming method includes a monotonic alignment search algorithm (MAS) or a niedlleman-wunsch algorithm (needleman-wunscha algorithm), etc. Since the dynamic programming can deduce the unaligned phonemes through monotonicity and the like, the probability of swallowing the phonemes can be reduced.

The corresponding relationship between the first speech and the phoneme is obtained through step 503, and the corresponding relationship is modified according to a dynamic programming method, so as to obtain the first time length information of each phoneme in the first phoneme sequence.

Optionally, a duration extraction module (e.g., MAS) in the attention network of fig. 6 is specifically configured to modify the correspondence between the first speech and the first phoneme sequence.

For example, the correspondence between each phoneme in the first phoneme sequence and the first speech obtained according to the attention network is a matrix P, where the matrix P is a matrix with M rows and N columns, where M is a dimension of the first text (e.g., ordinate 70 in fig. 7), and N is a dimension of the first speech spectrogram (e.g., abscissa 120 in fig. 7), this step may specifically include steps 1 to 3. Step 1: the Q matrix and the a matrix (which can also be understood as an all-zero matrix) are initialized. Step 2: the Q matrix is obtained in an iterative manner according to monotonicity by starting from the first point at the lower left corner of the diagonal line of fig. 7 (in particular, reference may be made to formula one). And step 3: updating the matrix A according to formula two starting from the last point in the Q matrix corresponding to the upper right corner of the diagonal of FIG. 7, e.g. with

Update the matrix A, i.e.

The element of the position is updated to 1, and a can be understood as the resulting optimal path. I.e. values corresponding to the abscissa 75-85 of fig. 7 corresponding to the ordinate. And then, correcting the graph 7 according to the values of the abscissa 75-85 corresponding to the ordinate to obtain the corresponding relationship between the first speech and the phoneme as shown in the graph 8, and further summing along the abscissa to obtain a vector with dimensions equal to the ordinate, wherein the vector is rounded up/down to obtain the first time length information.

The formula I is as follows: q_i,j＝max(Q_i-1,j-1,Q_i,j-1)+P_i,j；

The formula II is as follows:

wherein, P_i,jIt is understood to be a point in the curve of fig. 7, i is understood to be the abscissa and j is understood to be the ordinate. Formula one can be understood as an incremental directional evaluation, i.e. Q_i,jIs taken to be the left side Q_i-1,j-1Or lower left side Q_i,j-1The value of (c).

Is a function of the curves of fig. 7. Equation two may be understood as a decreasing directional evaluation,

is taken to be right

Or lower right side

The value of (c).

It should be understood that the above formula is only an example, and in practical applications, there may be other forms of formula, and the specific details are not limited herein.

Step 505, training a first prediction network based on the first phoneme sequence and the first time length information to obtain a trained first prediction network.

And training the first prediction network by taking the first phoneme sequence as the input of the first prediction network and taking the value of a first loss function smaller than a first threshold value as a target to obtain the trained first prediction network, wherein the first loss function is used for expressing the difference between the duration information output by the first prediction network and the first duration information. The trained first prediction network is used for predicting duration information of the text to be processed. It is to be understood that the loss function referred to in the embodiments of the present application is not limited to a specific form of the loss function.

Alternatively, for a scenario where the first speech comprises speech of at least two languages/dialects, the requirement for the language/dialect of the speaker is high due to the large amount of data set of the first speech for one speaker. In order to solve the problem, before the first prediction network is trained by using the mixed data set (the first text and the first speech), a single data set (i.e., a data set corresponding to one language/dialect) is obtained, and the single data set is used to train the second prediction network to obtain the first prediction network, and then the mixed data set is used to train the first prediction network further to obtain the trained first prediction network. The single data set includes a second text and a second speech corresponding to the second text, the second speech including speech in a language/dialect. That is, if the method is applied to the voice scenes of two languages/dialects, the embodiment further includes: and acquiring a second text and a second voice corresponding to the second text. A second phoneme sequence of the second text is obtained. And acquiring second duration information of each phoneme in the second phoneme sequence. And taking the second phoneme sequence as the input of the second prediction network, and training the second prediction network by taking the value of the second loss function smaller than the second threshold value as a target to obtain the first prediction network. The second loss is well-known to represent a difference between the duration information of the second predicted network output and the second duration information.

In the embodiment of the application, the first time length information of each phoneme in the first text can be obtained based on an attention mechanism and a dynamic programming method, and since the dynamic programming can infer the unaligned phoneme in a monotonicity manner and the like, the probability that the phoneme is estimated incorrectly (for example, the phoneme is misplaced or the phoneme is swallowed) can be reduced. Thereby achieving the correct alignment of the phonemes in the first text with the first speech. The method is conveniently applied to scenes such as speech synthesis and the like which need phoneme duration information.

The following description will be made taking a speech synthesis scenario as an example.

In this scenario, referring to fig. 9, in addition to the above steps 501 to 504, the embodiment may further include steps 901 to 903.

Step 901, a first mel spectrum feature of a first voice is obtained.

In this step, a first mel-spectrum feature of the first speech is obtained based on the first speech in the training data, and the first mel-spectrum feature can also be understood as an actual mel-spectrum feature of the first speech or the first text. The manner in which the first mel-frequency spectrum is obtained is not limited herein.

Step 902, obtain a first pronunciation feature. This step is optional.

The first pronunciation characteristics can include voiceprint characteristics and the like for describing pronunciation characteristics of a speaker or tone characteristics of the speaker. The manner of obtaining the first sound feature is similar to that described above, and may be obtained by means of LUP, SV, or neural network, and is not limited herein.

Of course, the first pronunciation feature may also be initialized, and the trained first pronunciation feature is obtained through the subsequent training.

Step 903, taking the first phoneme sequence and the first time length information as input of the first speech synthesis network, and training the first speech synthesis network by taking the value of the third loss function smaller than the third threshold as a target to obtain the trained first speech synthesis network and the trained first pronunciation characteristic.

Optionally, if the present embodiment includes step 902, this step may specifically include: and training the first voice synthesis network by taking the first phoneme sequence, the first time length information and the first pronunciation characteristics as input of the first voice synthesis network and taking the value of the third loss function smaller than the third threshold value as a target to obtain the trained first voice synthesis network and the trained first pronunciation characteristics. The following description is given by way of example only, and it is understood that step 902 may not be included.

The third loss function is used to represent a difference between a second mel-frequency spectrum feature outputted by the first speech synthesis network and the first mel-frequency spectrum feature, where the second mel-frequency spectrum feature is obtained after the first time length information is extended (described below with reference to fig. 10 and 11).

In the embodiment of the present application, there are various cases of training the first speech synthesis network, which are described below:

first, the first speech synthesis network includes an attention network.

In this manner, the attention network in step 503 may be understood to be the second network in the first speech synthesis network.

Alternatively, the architecture of a speech synthesis network (e.g. a first speech synthesis network or a subsequent second speech synthesis network) may refer to fig. 10, and the speech synthesis network may specifically include the first network and/or the second network. Wherein the first network includes a first encoder (EncodeA) and a first autoregressive decoder (DecodeA). The second network includes a second encoder (encorber b), an attention module, and a second autoregressive decoder (DecoderB). The first encoder/second encoder is operative to encode the first phoneme sequence as a first feature. The first autoregressive decoder/the second autoregressive decoder is used for obtaining the corresponding speech feature (such as a spectrum feature or a second Mel spectrum feature and the like) of the first text according to the first feature (or the second feature expanded by the expansion module). And in the training process of the first autoregressive decoder/the second autoregressive decoder, calculating the real voice characteristics corresponding to the previous step by each step as conditions for calculation. Wherein the second network may be understood as the aforementioned attention network. The attention module in the second network is to derive first time length information of phonemes used by the expansion module in the first network.

Further, in order to ensure the consistency of the preceding and following speeches, the first network may further include an extension module, and the extension module is configured to modify the phoneme duration information generated based on the attention module (or the prediction network as described above). I.e. it may be understood as upsampling the text vector (or may be understood as expanding the frame number of the vector) according to the duration of each phoneme in the first speech to obtain a vector of the corresponding frame number. The first autoregressive decoder is specifically configured to decode the expanded vector to obtain a mel spectrum feature, and further obtain a voice of the speaker for the first text.

Alternatively, the architecture of the speech synthesis network may be specifically as shown in fig. 11. Wherein the first network includes a first encoder (EncodeA), an extension module, and a first autoregressive decoder (DecodeA). The second network includes a second encoder (encorbb), an Attention module (Attention), a duration extraction Module (MAS), and a second autoregressive decoder (DecoderB). The status extension module includes an extension sub-module (State extension) and a Duration module (Duration). In the training process, the correspondence between the first phoneme sequence and the first speech is corrected by the MAS of the second network, and then each corrected phoneme duration information (i.e., the first duration information) is obtained (for a specific description, refer to the steps shown in fig. 5). And transmitting the first time length information to Duration in the state extension module. On one hand, the Duration sends the first time length information to the expansion submodule, and the expansion submodule uses the first time length information to expand the first characteristic output by the first encoder to obtain a second characteristic. On the other hand, Duration sends the first Duration information to a Duration prediction module (Duration Predictor), and then the training of the Duration prediction module is realized by using the first Duration information and the first phoneme sequence. The duration prediction module may be understood as the prediction network in step 505 above. On the other hand, Duration sends the first time length information to the second autoregressive decoder, which can be understood as training the second network by the difference between the phoneme Duration obtained by the Attention alignment information and the first time length information. The prediction network can be understood as being trained according to the duration information output by the attention network.

It is understood that the first pitch feature may also be added after the attention module processes the text vector. The second autoregressive decoder is specifically configured to decode the expanded vector to obtain a voice corresponding to the text by the speaker.

Further, the state extension module may further include a convolution module (Fusion). The convolution module is used for performing convolution processing on the second feature after the expansion of the expansion submodule and the first pitch feature to obtain a third feature, and then inputting the third feature into the first autoregressive decoder to obtain a second Mel spectrum feature. And further, the obtained second Mel spectral characteristics are more complete, or the second Mel spectral characteristics are understood to take the characteristics of the whole first phoneme sequence into consideration.

Alternatively, the first encoder, the second encoder, the first autoregressive decoder, and the second autoregressive decoder may be a recurrent neural network, a long-short-term memory (LSTM), or the like.

Illustratively, the second network may include a transform, a tacotron2, and the like.

Alternatively, the decoders (the first autoregressive decoder and the second autoregressive decoder) may be a unidirectional decoder, or may be a bidirectional decoder (i.e., two directions are parallel), and the details are not limited herein. The two directions refer to directions of the training text, may also be understood as directions of vectors corresponding to the first phoneme sequence, and may also be understood as a forward sequence or a reverse sequence of the first phoneme sequence, where one direction is that one side of the first phoneme sequence points to the other side of the training text, and the other direction is that the other side of the first phoneme sequence points to one side of the training text. If the decoder is a bidirectional decoder, the decoders in two directions (or positive and negative orders) are trained in parallel, and are independently calculated in the training process, so that no result dependence exists. Of course, if the speech synthesis network includes a prediction network, the prediction network may be referred to as a duration prediction module.

Illustratively, if the first text is: "Gunn boy", the first direction or positive order may be the direction from "Gunn" to "y", and the second direction or negative order may be the direction from "y" to "Gunn".

Further, similar to the aforementioned training of the attention network, for a multilingual/dialect scenario, the requirement for the speaker's language/dialect is high due to the large amount of data set of the first voice of the speaker. In order to solve the problem, before the first speech synthesis network is trained using the mixed data set (the first text and the first speech), a single data set (i.e., a data set corresponding to one language/dialect) is obtained, and the single data set is used to train the second speech synthesis network to obtain the first speech synthesis network, and then the mixed data set is used to further train the first speech synthesis network to obtain the trained first speech synthesis network. The single data set includes a third text and a third speech corresponding to the third text, the third speech including speech of one language/dialect. The third text may also be the aforementioned second text, and the third speech may be the aforementioned third speech.

If the first speech includes at least two types of languages/dialects, the training process of the first speech synthesis network may be understood as: acquiring a third text and a third voice corresponding to the third text, wherein the third voice comprises a voice of one language/dialect in at least two languages/dialects; acquiring a third phoneme sequence of a third text; acquiring third duration information of each phoneme in the third phoneme sequence; acquiring a second pronunciation characteristic, wherein the second pronunciation characteristic is used for describing the tone color characteristic of the third voice; acquiring a third Mel spectral feature of a third voice; and training the second speech synthesis network by taking the third phoneme sequence, the third duration information and the second pronunciation characteristic as input of the second speech synthesis network and taking the value of a fourth loss function smaller than a fourth threshold value as a target to obtain the first speech synthesis network and the trained second pronunciation characteristic, wherein the fourth loss function is used for representing the difference between a fourth Mel spectrum characteristic and a third Mel spectrum characteristic output by the second speech synthesis network, and the fourth Mel spectrum characteristic is obtained after the third duration information is expanded.

For example, the training of the first network and the second network can refer to fig. 11 and 12. It can be seen that, in the training of the multi-language/dialect mixed data set or in the training of the single-language/dialect mixed data set, the correspondence between the first speech and the phoneme obtained through the second network is provided to the first network after being processed by the MAS, and specifically may be provided to an extension module in the first network, so as to align the correspondence between the first speech and the phoneme, and reduce omission of the phoneme in the speech output by the speech synthesis network.

Second, the attention network is not in the speech synthesis network.

In this manner, the attention network in step 503 may be understood as a network other than the speech synthesis network.

Alternatively, the architecture of the speech synthesis network may refer to fig. 13, the speech synthesis network may be specifically understood as the first network in the foregoing description, and the description of the first network may refer to the foregoing description, which is not limited herein.

It should be noted that the training processes shown in the prediction network and the speech synthesis network may also adopt other training methods instead of the aforementioned training method, and are not limited herein.

In a possible implementation manner, the network training method in the embodiment of the present application includes steps 501 to 505, which are training methods of a predictive network. In another possible implementation manner, the network training method in the embodiment of the present application may include steps 501 to 505 and

steps

901 and 903, that is, a training method for a prediction network and a speech synthesis network, where the speech synthesis network is used to predict speech of a text to be processed. In another possible implementation manner, the network training method in the embodiment of the present application may include steps 501 to 505 and steps 901 to 903, that is, a training method of a prediction network and a speech synthesis network, where the speech synthesis network is used to predict the speech of a speaker for a text to be processed. The prediction network and the speech synthesis network provided in the embodiment of the present application may be applied to a single-language/dialect scenario, and may also be applied to a single-language/dialect scenario, and the like, which is not limited herein.

In this embodiment, the prediction network trained by the attention mechanism and the dynamic programming method may accurately predict the duration information of the phonemes, or may understand that a correct corresponding relationship between the phonemes and the speech may be obtained, so as to improve the speech listening impression obtained by the speech synthesis network, and reduce the speech listening impression worse due to missing of the phonemes.

First, an application scenario to which the data processing method provided in the embodiment of the present application is applied is described. The data processing method can be applied to voice interaction scenes including smart home, smart travel, smart government affairs and other scenes, and is not limited herein. For example: the data processing method can be applied to reading APP of a voice synthesis technology, can provide a reading function of mixed language with high naturalness for a user, and provides extreme experience. For another example: the data processing method can also be applied to equipment such as mobile phones, sound equipment and the like, and can broadcast fresh information for users at any time and any place. Another example is: the data processing method can also be applied to scenes such as taxi taking software, catering number calling, queuing software and the like, order broadcasting is carried out through voice synthesis, and notification information can be conveniently obtained. It can be understood that the data processing method can also be applied to intelligent hardware such as a child story machine, an intelligent robot, a tablet device and the like, so that the interaction between a user and the device is more natural and more intimate. It should be understood that the foregoing application scenarios are only examples, and in practical applications, there are other application scenarios, and the details are not limited herein.

The data processing device is a terminal device for serving a user or a cloud device. The terminal device may include a Head Mounted Display (HMD), which may be a combination of a Virtual Reality (VR) box and a terminal, a VR all-in-one machine, a Personal Computer (PC), an Augmented Reality (AR) device, a Mixed Reality (MR) device, and the like, and may further include a cellular phone (cellular phone), a smart phone (smart phone), a Personal Digital Assistant (PDA), a tablet computer, a laptop computer (laptop computer), a Personal Computer (PC), a vehicle-mounted terminal, a projector, a smart screen, a robot, and the like, which are not limited herein.

The data processing method provided by the embodiment of the application can be executed by the terminal device or the server independently, or can be completed by the terminal device and the server together, which are respectively described as follows:

the first embodiment is as follows: the terminal device or the server individually executes the data processing method.

Referring to fig. 15, an embodiment of a data processing method provided in this embodiment of the present application may be executed by a data processing device, or may be executed by a component (e.g., a processor, a chip, or a system-on-chip) of a data processing device, where the data processing device may be a terminal device or a server, and the embodiment includes steps 1501 to 1506.

Step 1501, acquiring a text to be processed.

In the embodiment of the present application, there are various ways for the data processing device to acquire the text to be processed, which may be a way of acquiring a text input by a user, a way of receiving a text to be processed sent by another device, a way of selecting a text to be processed from a database, and the like, and the specific details are not limited herein.

In this embodiment of the application, the text to be processed includes a text of a single language/dialect, or the text to be processed includes a text of multiple languages/dialects, where the description about the languages/dialects may refer to the description in the embodiment shown in fig. 5, and details are not repeated here.

Step 1502, based on the text to be processed, a phoneme sequence is obtained.

In this step, the phoneme sequence obtained based on the text to be processed is similar to the manner of obtaining the first phoneme sequence based on the first text in fig. 5, which is not described herein again.

In step 1503, pronunciation characteristics are obtained. This step is optional.

Alternatively, the pronunciation characteristics may include a voiceprint characteristic or the like for describing the pronunciation characteristics of the speaker, or the tone characteristics of the speaker. The manner of obtaining the pronunciation feature is similar to the manner of obtaining the first pronunciation feature in step 503, and may be obtained by LUP, SV or neural network, which is not limited herein.

And 1504, predicting duration information of each phoneme in the phoneme sequence based on the trained prediction network.

The step may specifically be inputting the phoneme sequence into a trained prediction network to obtain duration information of each phoneme in the phoneme sequence. The trained prediction network is obtained by training based on a first text and first time length information, the first time length information is obtained by modifying the corresponding relation between the first text and a first voice through a dynamic programming method, the first voice is the voice of the first text, and the corresponding relation is used for expressing the time length of each phoneme in the phoneme sequence in the first voice.

For specific description of the trained prediction network, reference may be made to the description of the first prediction network in the embodiment shown in fig. 5, and details are not repeated here.

Illustratively, the first time length information of each phoneme is (1,2,0), 1 represents the expansion multiple of the first phoneme, 2 represents the expansion multiple of the second phoneme, and 0 represents that the third phoneme is not expanded.

Step 1505, input the phoneme sequence, duration information and pronunciation features into the trained speech synthesis network to obtain the Mel spectrum features. This step is optional.

The step may specifically be inputting the phoneme sequence, the duration information, and the pronunciation characteristics into a trained speech synthesis network to obtain the mel-frequency spectrum characteristics.

Optionally, the trained speech synthesis network may be the trained first speech synthesis network in the embodiment shown in fig. 9, and details are not described here again.

Alternatively, a trained speech synthesis network may be as shown in fig. 16, the speech synthesis network including an encoder, an extension module, and an autoregressive decoder; the method specifically comprises the following steps: acquiring a first feature corresponding to the phoneme sequence based on an encoder; expanding the first characteristic based on the duration information (obtained by a prediction network) through an expansion module to obtain a second characteristic; acquiring a third feature based on the second feature and the pronunciation feature; and obtaining the Mel spectral characteristics based on the autoregressive decoder and the third characteristics. The above-mentioned mel spectrum feature obtained based on the third feature and the autoregressive decoder may be obtained by inputting the third feature into the autoregressive decoder, or may be obtained by performing convolution processing on the third feature by a convolution module (Fusion), and then inputting the fourth feature into the autoregressive decoder. The speech synthesis network shown in fig. 16 may also be understood as the first network described above in the embodiment of fig. 5, which does not include a predictive network.

It is to be understood that the encoder, the extension module, and the autoregressive decoder in this embodiment may refer to the descriptions in fig. 6, fig. 11, or fig. 14, and are not described herein again.

Illustratively, the first feature is (a, b, c), and the duration information is (1,2,0), i.e. it can be understood that a is extended once, b is extended 2 times, and c is extended 0 times, i.e. the extended second feature is (a, b, b). It is to be understood that a, b, c are used herein only for convenience of description how to expand, and do not specifically limit the first feature.

Step 1506, convert the mel-frequency spectrum feature into voice through the vocoder. This step is optional.

After obtaining the mel spectrum features, the mel spectrum features can be input into the vocoder to be converted into voice, and the voice is the voice of the speaker (or understood as the pronunciation features) to the text to be processed.

In a possible implementation manner, if the data processing device is a terminal device, the present embodiment includes the above-mentioned steps 1501 to 1506. In another possible implementation manner, if the data processing device is a server, step 1501 may specifically be receiving a text to be processed sent by the terminal device. And after step 1506, the method further comprises: and sending voice to the terminal equipment.

In addition, if only applied to the scene of finding the duration of the text phoneme, the present embodiment may include steps 1501 to 1504. If applied to a scenario of speech synthesis, the present embodiment may include steps 1501 to 1506.

In the embodiment, the second feature is obtained by obtaining duration information of phonemes in the text to be processed and expanding the first feature according to the duration information. And obtaining a third characteristic based on the second characteristic and the pronunciation characteristic, and generating the voice of the speaker for the text to be processed based on the third characteristic. Because the generated voice is obtained according to the duration information and the pronunciation characteristics, the voice synthesis of single language/dialect or cross language/dialect can be realized, the tone consistency of the generated voice is ensured, and the poor voice listening feeling caused by phoneme omission or phoneme dislocation is avoided.

Example two: the terminal device and the server execute the data processing method together.

Referring to fig. 17, another embodiment of the data processing method provided in this embodiment of the present application may be executed by a data processing apparatus, or may be executed by a component (e.g., a processor, a chip, or a system-on-chip) of a data processing apparatus, where the data processing apparatus includes a terminal apparatus and a server, and the embodiment includes steps 1701 to 1707.

In step 1701, the terminal device obtains a text to be processed.

In this step, the description of the terminal device acquiring the text to be processed is similar to that in step 1501 in the embodiment shown in fig. 15, and is not described here again.

Step 1702, the terminal device obtains pronunciation characteristics.

In this step, the description of the terminal device obtaining the feature to be sounded is similar to the description in step 1503 in the embodiment shown in fig. 15, and is not described here again.

And step 1703, the terminal device sends the text to be processed and the pronunciation characteristics to the server.

And after acquiring the text to be processed and the pronunciation characteristics, the terminal equipment sends the text to be processed and the pronunciation characteristics to the server.

In step 1704, the server inputs the text to be processed and the pronunciation characteristics into the trained speech synthesis network to obtain the speech characteristics of the speaker for the first text.

After receiving the text to be processed and the pronunciation characteristics sent by the terminal equipment, the server inputs the text to be processed and the pronunciation characteristics into the trained voice synthesis network to obtain the voice characteristics (such as Mel spectrum characteristics) of the speaker for the text to be processed.

Optionally, the description of the trained speech synthesis network may refer to the description in the embodiment shown in fig. 5, and is not repeated here. In addition, in the inference process, the speech synthesis network may be the description of the speech synthesis network in the embodiments shown in fig. 5 to fig. 11, and details are not described here again.

Step 1705, the server converts the speech feature to obtain the language through the vocoder.

After the server obtains the Mel spectrum characteristics, the Mel spectrum characteristics are input into the vocoder to obtain the voice of the speaker to the text to be processed.

Step 1706, the server sends the voice to the terminal device.

And after acquiring the voice information, the server sends the voice to the terminal equipment.

Step 1707, the terminal device plays the voice. This step is optional.

Optionally, after receiving the voice sent by the server, the terminal device may play the voice to the user.

In the embodiment, the duration information of each phoneme in the phoneme sequence obtained by introducing the prediction network in the reasoning stage is obtained, and the speech is obtained based on the duration information and the text to be processed, and the generated speech can reduce poor speech listening feeling caused by missing of the phonemes by the duration information of the phonemes obtained by the dynamic programming method. The method and the device can realize the speech synthesis of single language/dialect or cross language/dialect and ensure the tone consistency of the generated speech information.

In the above description of the model training method and the data processing method in the embodiment of the present application, referring to fig. 18, the following description of the data processing device in the embodiment of the present application, an embodiment of the data processing device (terminal device or server) in the embodiment of the present application includes:

an obtaining unit 1801, configured to obtain a first text and a first voice corresponding to the first text;

the obtaining unit 1801 is further configured to obtain a first phoneme sequence based on the first text;

an obtaining unit 1801, further configured to obtain, based on an attention mechanism, a correspondence between the first speech and the first phoneme sequence, where the correspondence is used to indicate a duration of each phoneme in the first phoneme sequence in the first speech;

a modification unit 1802, configured to modify the correspondence based on a dynamic programming method to obtain first time length information of each phoneme in the first phoneme sequence;

a training unit 1803, configured to train a first prediction network based on the first phoneme sequence and the first duration information, to obtain a trained first prediction network, where the trained first prediction network is used to predict duration information of each phoneme in the text to be processed.

In this embodiment, operations performed by each unit in the data processing apparatus are similar to those described in the embodiments shown in fig. 5 to 17, and are not described again here.

In this embodiment, the first time length information of each phoneme in the first text may be obtained based on an attention mechanism and a dynamic programming method, and since the dynamic programming may infer the unaligned phoneme in a monotonicity manner, the probability that the phoneme is estimated incorrectly (for example, the phoneme is misplaced or the phoneme is swallowed) may be reduced. The first prediction network trained by the training unit 1803 may further implement duration prediction of phonemes in the text to be processed. The method is conveniently applied to scenes such as speech synthesis and the like which need phoneme duration information.

Referring to fig. 19, another embodiment of a data processing device (terminal device or server) in the embodiment of the present application includes:

an obtaining unit 1901, configured to obtain a text to be processed;

the obtaining unit 1901 is further configured to obtain a phoneme sequence based on the text to be processed;

an obtaining unit 1901, further configured to obtain a pronunciation feature, where the pronunciation feature is used to describe a timbre feature of a speaker;

a prediction unit 1902, configured to predict duration information of each phoneme in the phoneme sequence based on a trained prediction network, where the trained prediction network is obtained by training the prediction network with a first text as an input of the prediction network and a value of a loss function smaller than a threshold as a target, where the loss function is used to represent a difference between duration information output by the prediction network and the first duration information, the first duration information is obtained by modifying a corresponding relationship between the first text and a first speech through a dynamic programming method, and the first speech is a speech of the first text;

optionally, the data processing apparatus may further include: a processing unit 1903, configured to input the phoneme sequence and the duration information into the trained speech synthesis network, so as to obtain a mel-frequency spectrum feature;

optionally, the data processing apparatus may further include: a conversion unit 1904, configured to convert the mel spectrum feature into speech through a vocoder, where the speech is speech of a text to be processed.

In this embodiment, the prediction unit 1902 introduces the duration information of each phoneme in the phoneme sequence obtained by the prediction network through the inference stage, and since the duration information is obtained by modifying the corresponding relationship between the first text and the first speech based on the dynamic programming method, the problem of inaccurate duration prediction caused by missing phonemes or misplacing phonemes can be avoided.

Referring to fig. 20, a schematic diagram of another data processing apparatus is provided. The data processing apparatus may include a processor 2001, a memory 2002 and a communication interface 2003. The processor 2001, memory 2002 and communication interface 2003 are interconnected by wires. The memory 2002 has stored therein program instructions and data.

The memory 2002 stores program instructions and data corresponding to the steps performed by the device in the corresponding embodiments shown in fig. 5-17.

A processor 2001 for performing the steps performed by the apparatus as shown in any one of the embodiments shown in fig. 5 to 17.

The communication interface 2003 may be used for receiving and sending data for performing the steps related to the acquisition, sending and receiving in any of the embodiments shown in fig. 5 to 17.

In one implementation, the data processing device may include more or fewer components than those shown in FIG. 20, which are merely illustrative and not limiting.

Referring to fig. 21, the embodiment of the present application provides another data processing apparatus, and for convenience of description, only the portions related to the embodiment of the present application are shown, and details of the specific technology are not disclosed, please refer to the method portion of the embodiment of the present application. The data processing device may be any data processing device including a mobile phone, a tablet computer, a Personal Digital Assistant (PDA), a point of sale (POS), a vehicle-mounted computer, etc., taking the data processing device as the mobile phone as an example:

fig. 21 is a block diagram illustrating a partial structure of a mobile phone related to a data processing device provided in an embodiment of the present application. Referring to fig. 21, the cellular phone includes: radio Frequency (RF) circuit 2110, memory 2120, input unit 2130, display unit 2140, sensor 2150, audio circuit 2160, wireless fidelity (WiFi) module 2170, processor 2180, and power source 2190. Those skilled in the art will appreciate that the handset configuration shown in fig. 21 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The following describes each component of the mobile phone in detail with reference to fig. 21:

the RF circuit 2110 may be used for receiving and transmitting signals during information transmission and reception or during a call, and particularly, receives downlink information of a base station and then processes the received downlink information to the processor 2180; in addition, the data for designing uplink is transmitted to the base station. In general, the RF circuit 2110 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuit 2110 may also communicate with networks and other devices via wireless communication. The wireless communication may use any communication standard or protocol, including but not limited to global system for mobile communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), etc.

The memory 2120 may be used for storing software programs and modules, and the processor 2180 executes various functional applications and data processing of the mobile phone by running the software programs and modules stored in the memory 2120. The memory 2120 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Additionally, the memory 2120 can include high-speed random access memory, and can also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The input unit 2130 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the cellular phone. Specifically, the input unit 2130 may include a touch panel 2131 and other input devices 2132. The touch panel 2131, also referred to as a touch screen, can collect touch operations performed by a user on or near the touch panel 2131 (e.g., operations performed by the user on or near the touch panel 2131 using any suitable object or accessory such as a finger or a stylus), and drive a corresponding connection device according to a preset program. Alternatively, the touch panel 2131 may include two parts, namely, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 2180, and can receive and execute commands sent by the processor 2180. In addition, the touch panel 2131 can be implemented by various types, such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. The input unit 2130 may include other input devices 2132 in addition to the touch panel 2131. In particular, other input devices 2132 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 2140 may be used to display information input by the user or information provided to the user, and various menus of the cellular phone. The display unit 2140 may include a display panel 2141, and optionally, the display panel 2141 may be configured in the form of a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), or the like. Further, the touch panel 2131 can cover the display panel 2141, and when the touch panel 2131 detects a touch operation on or near the touch panel 2131, the touch operation is transmitted to the processor 2180 to determine the type of the touch event, and then the processor 2180 provides a corresponding visual output on the display panel 2141 according to the type of the touch event. Although the touch panel 2131 and the display panel 2141 are shown as two separate components in fig. 21 to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 2131 and the display panel 2141 may be integrated to implement the input and output functions of the mobile phone.

The handset may also include at least one sensor 2150, such as light sensors, motion sensors, and other sensors. Specifically, the light sensor may include an ambient light sensor that adjusts the brightness of the display panel 2141 according to the brightness of ambient light, and a proximity sensor that turns off the display panel 2141 and/or the backlight when the mobile phone is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when stationary, and can be used for applications of recognizing the posture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the mobile phone, further description is omitted here.

Audio circuitry 2160, speaker 2161, and microphone 2162 may provide an audio interface between a user and a cell phone. The audio circuit 2160 can transmit the electrical signal converted from the received audio data to the speaker 2161, and the electrical signal is converted into a sound signal by the speaker 2161 and output; on the other hand, the microphone 2162 converts collected sound signals into electrical signals, which are received by the audio circuit 2160 and converted into audio data, which are processed by the audio data output processor 2180, and then transmitted to, for example, another cellular phone via the RF circuit 2110, or output to the memory 2120 for further processing.

WiFi belongs to short-distance wireless transmission technology, and the mobile phone can help a user to receive and send emails, browse webpages, access streaming media and the like through the WiFi module 2170, and provides wireless broadband internet access for the user. While fig. 21 shows WiFi module 2170, it is to be understood that it does not belong to the essential component of the handset.

The processor 2180 is a control center of the mobile phone, connects various parts of the whole mobile phone by using various interfaces and lines, and performs various functions of the mobile phone and processes data by operating or executing software programs and/or modules stored in the memory 2120 and calling data stored in the memory 2120, thereby integrally monitoring the mobile phone. Optionally, the processor 2180 may include one or more processing units; preferably, the processor 2180 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It is to be appreciated that the modem processor described above may not be integrated into processor 2180.

The phone also includes a power source 2190 (e.g., a battery) for powering the various components, and preferably, the power source may be logically connected to the processor 2180 via a power management system, so that the power management system may be used to manage charging, discharging, and power consumption.

Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which are not described herein.

In this embodiment, the processor 2180 included in the data processing apparatus may execute the functions in the embodiments shown in fig. 5 to 17, which are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated units described above may be implemented in whole or in part by software, hardware, firmware, or any combination thereof.

When the integrated unit is implemented in software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely descriptive of the various embodiments of the application and how objects of the same nature can be distinguished. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Claims

1. A method of network training, the method comprising:

acquiring a first text and a first voice corresponding to the first text;

obtaining a first phoneme sequence based on the first text;

acquiring a corresponding relation between the first voice and the first phoneme sequence based on an attention mechanism, wherein the corresponding relation is used for representing the duration of each phoneme in the first phoneme sequence in the first voice;

modifying the corresponding relation based on a dynamic programming method to obtain first time length information of each phoneme in the first phoneme sequence;

and training a first prediction network based on the first phoneme sequence and the first duration information to obtain a trained first prediction network, wherein the trained first prediction network is used for predicting duration information of each phoneme in the text to be processed.

2. The method of claim 1, wherein training a first prediction network based on the first phoneme sequence and the first time length information comprises:

and training the first prediction network by taking the first phoneme sequence as the input of the first prediction network and taking the value of a first loss function smaller than a first threshold value as a target to obtain the trained first prediction network, wherein the first loss function is used for representing the difference between the time length information output by the first prediction network and the first time length information.

3. The method according to claim 1 or 2, wherein the first speech comprises speech of at least two languages/dialects; the method further comprises the following steps:

acquiring a second text and a second voice corresponding to the second text, wherein the second voice comprises a voice of one language/dialect in the at least two languages/dialects;

acquiring a second phoneme sequence of the second text;

acquiring second duration information of each phoneme in the second phoneme sequence;

and training the second prediction network by taking the second phoneme sequence as the input of the second prediction network and taking the value of a second loss function smaller than a second threshold value as a target to obtain the first prediction network, wherein the second loss function is used for representing the difference between the duration information output by the second prediction network and the second duration information.

4. The method according to any one of claims 1 to 3, further comprising:

acquiring a first Mel spectral feature of the first voice;

acquiring a first pronunciation characteristic, wherein the first pronunciation characteristic is used for describing a tone color characteristic of the first voice;

and training the first speech synthesis network by taking the first phoneme sequence, the first time length information and the first pronunciation feature as input of the first speech synthesis network and taking a value of a third loss function smaller than a third threshold value as a target to obtain a trained first speech synthesis network and a trained first pronunciation feature, wherein the third loss function is used for representing a difference between a second Mel spectral feature output by the first speech synthesis network and the first Mel spectral feature, and the second Mel spectral feature is obtained after the first time length information is expanded.

5. The method of claim 4, wherein the first speech synthesis network comprises an encoder and an autoregressive decoder;

the training of the first speech synthesis network with the first phoneme sequence and the first time length information as the input of the first speech synthesis network and the value of the third loss function smaller than the third threshold as the target to obtain the trained first speech synthesis network includes:

acquiring a first feature corresponding to the first phoneme sequence based on the encoder;

expanding the first characteristic based on the first time length information to obtain a second characteristic;

obtaining the second mel-frequency spectrum characteristic based on an autoregressive decoder and the second characteristic;

and training the encoder and the autoregressive decoder by taking the value of the third loss function smaller than the third threshold value as a target to obtain the trained first speech synthesis network.

6. The method of claim 5, wherein obtaining the second Mel spectral feature based on the autoregressive decoder and the second feature comprises:

and inputting the second characteristic into the autoregressive decoder to obtain the second Mel spectral characteristic.

7. The method of claim 5, wherein obtaining the second Mel spectral feature based on the autoregressive decoder and the second feature comprises:

performing convolution processing on the second features to obtain third features;

and inputting the third feature into the autoregressive decoder to obtain the second Mel spectral feature.

8. The method according to any one of claims 4 to 7, wherein the first speech comprises speech of at least two languages/dialects;

the method further comprises the following steps:

acquiring a third text and a third voice corresponding to the third text, wherein the third voice comprises a voice of one language/dialect in at least two languages/dialects;

acquiring a third phoneme sequence of the third text;

acquiring third duration information of each phoneme in the third phoneme sequence;

acquiring a second pronunciation characteristic, wherein the second pronunciation characteristic is used for describing the tone color characteristic of the third voice;

acquiring a third Mel spectral feature of the third voice;

and training the second speech synthesis network by taking the third phoneme sequence, the third duration information and the second pronunciation feature as input of a second speech synthesis network and taking a value of a fourth loss function smaller than a fourth threshold as a target to obtain the first speech synthesis network and the trained second pronunciation feature, wherein the fourth loss function is used for representing a difference between a fourth Mel spectral feature output by the second speech synthesis network and the third Mel spectral feature, and the fourth Mel spectral feature is obtained after the third duration information is expanded.

9. The method according to any of claims 1 to 8, characterized in that the dynamic programming method comprises a Monotonic Alignment Search (MAS) method or a Nedman-Wenqueh algorithm.

10. A method of data processing, the method comprising:

acquiring a text to be processed;

obtaining a phoneme sequence of the text to be processed based on the text to be processed;

predicting duration information of each phoneme in the phoneme sequence based on a trained prediction network, wherein the trained prediction network is obtained by training based on a first text and first duration information, the first duration information is obtained by modifying a corresponding relation between the first text and a first voice through a dynamic programming method, the first voice is a voice of the first text, and the corresponding relation is used for representing the duration of each phoneme in the phoneme sequence in the first voice.

11. The method of claim 10, further comprising:

inputting the phoneme sequence and the duration information into a trained voice synthesis network to obtain Mel spectrum characteristics, wherein the trained voice synthesis network is used for generating voices corresponding to texts;

and converting the Mel spectrum feature into voice through a vocoder, wherein the voice is the voice of the text to be processed.

12. The method of claim 11, further comprising:

acquiring pronunciation characteristics, wherein the pronunciation characteristics are used for describing tone characteristics of a speaker;

inputting the phoneme sequence and the duration information into a trained speech synthesis network to obtain a Mel spectrum characteristic, comprising:

and inputting the phoneme sequence, the duration information and the pronunciation characteristics into the trained speech synthesis network to obtain the Mel spectrum characteristics.

13. The method of claim 12, wherein the trained speech synthesis network comprises an encoder and an autoregressive decoder;

the inputting the phoneme sequence, the duration information and the pronunciation characteristics into a trained speech synthesis network includes:

acquiring the first feature corresponding to the phoneme sequence based on the encoder;

expanding the first characteristic based on the duration information to obtain a second characteristic;

acquiring a third feature based on the second feature and the pronunciation feature;

and obtaining the Mel spectral characteristics based on an autoregressive decoder and the third characteristics.

14. The method of claim 13, wherein obtaining the mel-spectrum feature based on the autoregressive decoder and the third feature comprises:

inputting the third feature into the autoregressive decoder to obtain the Mel spectrum feature.

15. The method of claim 13, wherein obtaining the mel-spectrum feature based on the autoregressive decoder and the third feature comprises:

performing convolution processing on the third features to obtain fourth features;

inputting the fourth feature into the autoregressive decoder to obtain the Mel spectrum feature.

16. The method according to any one of claims 10 to 15, wherein the text to be processed includes text of at least two languages/dialects, and the first feature is further used for describing the language/dialect to which the phoneme belongs.

17. The method according to any of claims 10 to 16, characterized in that the dynamic programming method comprises a monotonic alignment search MAS method or a niemann-winche algorithm.

18. A data processing apparatus, characterized in that the data processing apparatus comprises:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a first text and a first voice corresponding to the first text;

the obtaining unit is further configured to obtain a first phoneme sequence based on the first text;

the obtaining unit is further configured to obtain, based on an attention mechanism, a correspondence between the first speech and the first phoneme sequence, where the correspondence is used to indicate a duration of each phoneme in the first phoneme sequence in the first speech;

a correcting unit, configured to correct the correspondence based on a dynamic programming method to obtain first time length information of each phoneme in the first phoneme sequence;

and the training unit is used for training a first prediction network based on the first phoneme sequence and the first duration information to obtain a trained first prediction network, and the trained first prediction network is used for predicting duration information of each phoneme in the text to be processed.

19. The apparatus according to claim 18, wherein the modifying unit is specifically configured to use the first phoneme sequence as an input of the first prediction network, train the first prediction network with a first loss function having a value smaller than a first threshold as a target to obtain the trained first prediction network, where the first loss function is used to represent a difference between duration information output by the first prediction network and the first duration information.

20. A data processing apparatus, characterized in that the data processing apparatus comprises:

the acquisition unit is used for acquiring a text to be processed;

the acquisition unit is further configured to obtain a phoneme sequence based on the text to be processed;

the acquisition unit is further used for acquiring pronunciation characteristics, and the pronunciation characteristics are used for describing the tone characteristics of a speaker;

the prediction unit is configured to predict duration information of each phoneme in the phoneme sequence based on a trained prediction network, where the trained prediction network is obtained by training based on a first text and first duration information, the first duration information is obtained by modifying a correspondence between the first text and a first speech through a dynamic programming method, the first speech is a speech of the first text, and the correspondence is used to indicate a duration of each phoneme in the phoneme sequence in the first speech.

21. The apparatus of claim 20, wherein the data processing apparatus further comprises:

the processing unit is used for inputting the phoneme sequence and the duration information into a trained speech synthesis network to obtain a Mel spectrum characteristic;

and the conversion unit is used for converting the Mel spectrum characteristics into voice through a vocoder, wherein the voice is the voice of the speaker to the text to be processed.

22. A data processing apparatus, characterized by comprising: a processor coupled with a memory for storing a program or instructions that, when executed by the processor, cause the data processing apparatus to perform the method of any of claims 1 to 9 or cause the data processing apparatus to perform the method of any of claims 10 to 17.

23. A computer-readable storage medium having stored therein instructions which, when executed on a computer, cause the computer to perform the method of any one of claims 1 to 9 or cause the computer to perform the method of any one of claims 10 to 17.

24. A computer program product, characterized in that the computer program product, when executed on a computer, causes the computer to perform the method of any one of claims 1 to 9 or causes the computer to perform the method of any one of claims 10 to 17.