WO2023116243A1

WO2023116243A1 - Data conversion method and computer storage medium

Info

Publication number: WO2023116243A1
Application number: PCT/CN2022/130735
Authority: WO
Inventors: 任意; 雷鸣; 黄智颖; 张仕良; 陈谦; 鄢志杰
Original assignee: 阿里巴巴达摩院(杭州)科技有限公司
Priority date: 2021-12-20
Filing date: 2022-11-08
Publication date: 2023-06-29
Also published as: CN113948062B; CN113948062A

Abstract

A data conversion method and a corresponding apparatus thereof, an electronic device, a computer storage medium, and a computer program product. The data conversion method comprises: obtaining a phoneme vector, a text vector and a voiceprint feature vector of a target voice corresponding to a text to be converted; obtaining, according to the phoneme vector and the text vector, a linguistic feature vector corresponding to said text; and predicting, according to the text vector and the voiceprint feature vector, to obtain a hidden rhythm vector of said text; and generating, according to the linguistic feature vector, the hidden rhythm vector and the voiceprint feature vector, speech spectrum information corresponding to said text. According to the method, the rhythm determined for the text to be converted into speech can be more accurate.

Description

Data conversion method and computer storage medium

This application claims the priority of a Chinese patent application with application number 202111559250.5 and application title "Data conversion method and computer storage medium" filed with the China Patent Office on December 20, 2021, the entire contents of which are incorporated herein by reference .

technical field

The embodiments of the present application relate to the field of computer technologies, and in particular, to a data conversion method and a computer storage medium.

Background technique

Speech synthesis technology, also known as Text to Speech (Text to Speech) technology, can convert text information into standard and smooth speech, which is equivalent to installing an artificial mouth on a machine. To achieve an effect that is more similar to the human voice, high-expressive speech synthesis is required. This kind of speech synthesis needs to model prosody, and the expressiveness of speech synthesis is improved through the prosody model.

In general, prosodic components include: fundamental frequency, energy, and duration. Existing prosody modeling is usually constructed based on the fundamental frequency features of the prosody, but on the one hand, due to the inaccurate extraction of the fundamental frequency, the effect of prosody modeling is poor, which further leads to inaccurate prosody information obtained from it; on the other hand, Failure to take into account the correlation between factors affecting prosody also results in poor prosody modeling and inaccurate prosody information.

Therefore, the prosody information obtained based on the current prosody modeling method has the problem of poor accuracy.

Contents of the invention

In view of this, an embodiment of the present application provides a data conversion solution to at least partially solve the above problem.

According to the first aspect of the embodiments of the present application, a data conversion method is provided, including:

Obtaining the phoneme vector, text vector, and voiceprint feature vector of the target human voice corresponding to the text to be converted; according to the phoneme vector and the text vector, obtaining the linguistic feature vector corresponding to the text to be converted; according to the text vector and the voiceprint feature vector, predicting and obtaining the hidden prosodic vector of the text to be converted; generating the corresponding voice of the text to be converted according to the linguistic feature vector, the hidden prosody vector and the voiceprint feature vector spectrum information.

According to a second aspect of the embodiments of the present application, a data conversion method is provided, including:

Obtaining a response to the user instruction sent to the smart device, the response includes text to be replied to the user instruction;

Acquiring the phoneme vector, text vector and voiceprint feature vector of the target human voice corresponding to the text to be replied;

Obtaining a linguistic feature vector corresponding to the text to be replied according to the phoneme vector and the text vector; predicting and obtaining a hidden prosody vector of the text to be replied according to the text vector and the voiceprint feature vector;

generating speech spectrum information corresponding to the text to be replied according to the linguistic feature vector, the hidden prosody vector and the voiceprint feature vector;

Generate and play the voice corresponding to the text to be replied according to the voice spectrum information.

According to a third aspect of the embodiments of the present application, a data conversion method is provided, including:

Obtain the live script text corresponding to the object to be broadcasted;

Obtaining the phoneme vector, text vector and voiceprint feature vector of the target human voice corresponding to the live script text;

Obtaining a linguistic feature vector corresponding to the live script text according to the phoneme vector and the text vector; predicting and obtaining a hidden prosody vector of the live script text according to the text vector and the voiceprint feature vector;

generating voice spectrum information corresponding to the live script text according to the linguistic feature vector, the hidden prosody vector and the voiceprint feature vector;

Generate live voice corresponding to the live script text according to the voice spectrum information.

According to a fourth aspect of the embodiments of the present application, a data conversion method is provided, including:

Obtain the script text to be played, wherein, the script text to be played includes one of the following: audio or video corresponding line script, e-book text content;

Acquiring the phoneme vector, text vector and voiceprint feature vector of the target human voice corresponding to the script text;

Obtaining a linguistic feature vector corresponding to the script text according to the phoneme vector and the text vector; predicting and obtaining a hidden prosody vector of the script text according to the text vector and the voiceprint feature vector;

generating speech spectrum information corresponding to the script text according to the linguistic feature vector, the hidden prosody vector and the voiceprint feature vector;

Generate the performance voice corresponding to the script text according to the voice spectrum information.

According to a fifth aspect of the embodiments of the present application, a data conversion method is provided, including:

Obtaining the phoneme vector corresponding to the text to be converted through the phoneme coding network of the prosodic model; and obtaining the text vector corresponding to the text to be converted through the text coding network of the prosodic model;

Predicting and obtaining the hidden prosodic vector of the text to be converted through the hidden prosodic vector prediction network of the prosody model according to the text vector corresponding to the text to be converted and the acquired voiceprint feature vector of the target human voice;

Adding the phoneme vector and the text vector through the vector splicing layer of the prosody model to obtain a linguistic feature vector corresponding to the text to be converted; and, for the linguistic feature vector, the hidden prosody The vector and the voiceprint feature vector are spliced to generate a spliced vector;

The splicing vector is decoded by the decoding network of the prosody model to obtain the speech spectrum information corresponding to the text to be converted.

According to a sixth aspect of the embodiments of the present application, a data conversion device is provided, including:

The obtaining module is used to obtain the voiceprint feature vector of the phoneme vector, text vector and target human voice corresponding to the text to be converted;

The obtaining module is also used to obtain a linguistic feature vector corresponding to the text to be converted according to the phoneme vector and the text vector;

A prediction module, configured to predict and obtain the hidden prosody vector of the text to be converted according to the text vector and the voiceprint feature vector;

A generating module, configured to generate speech spectrum information corresponding to the text to be converted according to the linguistic feature vector, the hidden prosodic vector, and the voiceprint feature vector.

According to a seventh aspect of the embodiments of the present application, a data conversion device is provided, including:

An acquisition module, configured to acquire a response to a user instruction sent to the smart device, where the response contains text to be replied to the user instruction;

The obtaining module is also used to obtain the phoneme vector, text vector and voiceprint feature vector of the target human voice corresponding to the text to be replied;

The obtaining module is also used to obtain a linguistic feature vector corresponding to the text to be replied according to the phoneme vector and the text vector;

A prediction module, configured to predict and obtain the hidden prosody vector of the text to be replied according to the text vector and the voiceprint feature vector;

A generation module, configured to generate speech spectrum information corresponding to the text to be replied according to the linguistic feature vector, the hidden prosody vector and the voiceprint feature vector;

A processing module, configured to generate and play the voice corresponding to the text to be replied according to the voice spectrum information.

According to an eighth aspect of the embodiments of the present application, a data conversion device is provided, including:

An acquisition module, configured to acquire the live script text corresponding to the object to be broadcasted;

The acquiring module is also used to acquire the phoneme vector, text vector and voiceprint feature vector of the target human voice corresponding to the live script text;

The obtaining module is also used to obtain a linguistic feature vector corresponding to the live script text according to the phoneme vector and the text vector;

A prediction module, configured to predict and obtain the hidden prosody vector of the live script text according to the text vector and the voiceprint feature vector;

A generating module, configured to generate voice spectrum information corresponding to the live script text according to the linguistic feature vector, the hidden prosody vector and the voiceprint feature vector;

A processing module, configured to generate live speech corresponding to the live script text according to the speech spectrum information.

According to a ninth aspect of the embodiments of the present application, a data conversion device is provided, including:

The obtaining module is used to obtain the script text to be played, wherein the script text to be played includes one of the following: line scripts corresponding to audio or video, and text content of e-books;

The acquisition module is also used to acquire the phoneme vector, text vector and voiceprint feature vector of the target human voice corresponding to the script text;

The obtaining module is also used to obtain a linguistic feature vector corresponding to the script text according to the phoneme vector and the text vector;

A prediction module, configured to predict and obtain the hidden prosody vector of the script text according to the text vector and the voiceprint feature vector;

A generating module, configured to generate speech spectrum information corresponding to the script text according to the linguistic feature vector, the hidden prosody vector and the voiceprint feature vector;

A processing module, configured to generate a performance voice corresponding to the script text according to the voice spectrum information.

According to a tenth aspect of the embodiments of the present application, a data conversion device is provided, including:

An acquisition module, configured to acquire a phoneme vector corresponding to the text to be converted through the phoneme encoding network of the prosodic model; and acquire a text vector corresponding to the text to be converted through the text encoding network of the prosodic model;

The prediction module is used to predict and obtain the hidden prosodic vector of the text to be converted through the hidden prosodic vector prediction network of the prosody model according to the text vector corresponding to the text to be converted and the acquired voiceprint feature vector of the target human voice;

The acquisition module is also used to add the phoneme vector and the text vector through the vector splicing layer of the prosody model to obtain the linguistic feature vector corresponding to the text to be converted;

A generating module, configured to splice the linguistic feature vector, the hidden prosody vector, and the voiceprint feature vector to generate a stitching vector;

A processing module, configured to decode the concatenation vector through the decoding network of the prosody model to obtain speech spectrum information corresponding to the text to be converted.

According to an eleventh aspect of the embodiments of the present application, an electronic device is provided, including:

memory for storing programs;

A processor, configured to execute the program stored in the memory, and when the program is executed, the processor is configured to execute the data conversion method according to any one of the first aspect to the fifth aspect.

According to a twelfth aspect of the embodiments of the present application, there is provided a computer storage medium, on which a computer program is stored, and when the program is executed by a processor, the data conversion as described in any one of the first aspect to the fifth aspect is realized method.

According to the thirteenth aspect of the embodiments of the present application, there is provided a computer program product, including a computer program, when the computer program is executed by a processor, it implements the data conversion method described in any one of the first to fifth aspects above .

According to the data conversion scheme provided by the embodiment of the present application, when acquiring the frequency spectrum of the text to be converted that needs to be converted into speech, the phonemes of the text to be converted, the voiceprint features of the text and the target human voice are taken into consideration. Among them, the linguistic features of the text to be converted can be obtained based on phonemes and texts, which carry the pronunciation features of the corresponding level of the text (such as character level, word level, sentence level, etc.); based on text and voiceprint features, it can be predicted and obtained The hidden prosodic vector of the text to be converted, which mainly contains prosody information. The prosody in this way is obtained based on the corresponding features of the text, and more attention is paid to the characteristics of the prosody itself. The speech spectrum information obtained after processing based on linguistic features, hidden prosodic vectors and voiceprint features is more in line with the speech characteristics of the target human voice corresponding to the actual voiceprint features, and is closer to the actual rhythm of the target human voice. for close. As a result, the subsequent speech generated based on the obtained speech spectrum information is also closer to the actual human voice.

It can be seen that, through the solution of the embodiment of the present application, on the one hand, the prosody modeling is no longer based on the fundamental frequency, but the prosody information is extracted based on various information related to the prosody, which can make the extracted prosody more accurate; on the other hand On the one hand, the comprehensive consideration of the relationship among various factors affecting the prosody (such as phoneme, text, voiceprint of the target human voice, etc.) also makes the prosody thus obtained more accurate.

Description of drawings

FIG. 1 is a schematic diagram of an exemplary system applicable to the data conversion method of the embodiment of the present application;

FIG. 2A is a flowchart of steps of a data conversion method according to Embodiment 1 of the present application;

Fig. 2B is a schematic diagram of a model example in the embodiment shown in Fig. 2A;

FIG. 2C is a schematic diagram of a scenario example in the embodiment shown in FIG. 2A;

FIG. 3A is a flowchart of steps of a data conversion method according to Embodiment 2 of the present application;

Fig. 3B is a schematic diagram of a model and an example of its training process in the embodiment shown in Fig. 3A;

FIG. 4 is a schematic structural diagram of an electronic device according to Embodiment 3 of the present application.

Detailed ways

In order to enable those skilled in the art to better understand the technical solutions in the embodiments of the present application, the following will clearly and completely describe the technical solutions in the embodiments of the present application in conjunction with the drawings in the embodiments of the present application. Obviously, the described The embodiments are only some of the embodiments of the present application, but not all of them. All other embodiments obtained by persons of ordinary skill in the art based on the embodiments in the embodiments of the present application shall fall within the protection scope of the embodiments of the present application.

The specific implementation of the embodiment of the present application will be further described below in conjunction with the accompanying drawings of the embodiment of the present application.

Fig. 1 shows an exemplary system applicable to the data conversion method of the embodiment of the present application. As shown in FIG. 1, the system 100 may include a server 102, a communication network 104, and/or one or more user devices 106, exemplified in FIG. 1 as a plurality of user devices.

Server 102 may be any suitable server for storing information, data, programs, and/or any other suitable type of content. In some embodiments, server 102 may perform any suitable function. For example, in some embodiments, the server 102 may be used to determine the speech spectrum information to be used in the speech synthesis process. As an optional example, in some embodiments, the server 102 may be used to determine the corresponding speech spectrum information based on the text to be converted, and then perform speech synthesis based on the speech spectrum information. As another example, in some embodiments, the server 102 may determine the corresponding voice spectrum information based on the phoneme corresponding to the text to be converted, the text, and the voiceprint of the target human voice.

In some embodiments, communication network 104 may be any suitable combination of one or more wired and/or wireless networks. For example, communication network 104 can include any one or more of the following: Internet, Intranet, Wide Area Network (Wide Area Network, WAN), Local Area Network (Local Area Network, LAN), wireless network, Digital Subscriber Line (DSL) Subscriber Line (DSL) network, frame relay network, asynchronous transfer mode (Asynchronous Transfer Mode, ATM) network, virtual private network (Virtual Private Network, VPN) and/or any other suitable communication network. User device 106 can be connected to communication network 104 via one or more communication links (eg, communication link 112), which can be linked to Server 102. The communication link may be any communication link suitable for transferring data between the user device 106 and the server 102, such as a network link, a dial-up link, a wireless link, a hardwired link, any other suitable communication link or any suitable combination of such links.

The user equipment 106 may include any one or more user equipment suitable for presenting an interface for information input and output, and for playing voice. In some embodiments, user equipment 106 may comprise any suitable type of equipment. For example, in some embodiments, user devices 106 may include IoT devices, mobile devices, tablet computers, laptop computers, desktop computers, wearable computers, game consoles, media players, vehicle entertainment systems, and/or any other Appropriate type of user equipment. Note that, in some embodiments, if the user equipment 106 has higher software and hardware performance, it can also replace the function of the server 102 .

Although server 102 is illustrated as one device, in some embodiments any suitable number of devices may be used to perform the functions performed by server 102 . For example, in some embodiments, multiple devices may be used to implement the functions performed by server 102 . Alternatively, the functions of the server 102 may be implemented using cloud services.

Based on the above system, an embodiment of the present application provides a data conversion method, which will be described below through multiple embodiments.

Embodiment one

Referring to FIG. 2A , it shows a flowchart of steps of a data conversion method according to Embodiment 1 of the present application.

The data conversion method of the present embodiment includes the following steps:

Step S202: Obtain the phoneme vector, text vector and voiceprint feature vector of the target human voice corresponding to the text to be converted.

Among them, a phoneme is the smallest unit of speech divided according to the natural properties of speech. It is analyzed according to the pronunciation action in a syllable, and an action constitutes a phoneme. For example, 阿 (a) has only one phoneme, 个 (ge) has two phonemes, and so on. In general, phonemes are an important consideration and conversion basis in the process of converting text to speech. In the specific conversion process, it is necessary to determine what kind of human voice to convert the text into. Therefore, the voiceprint feature needs to be used as a reference to finally generate a voice similar to the target human voice.

In addition, the text vector of the text to be converted is also used in the embodiment of the present application. In practical applications, text vectors can adopt different levels, such as phoneme level, character level, word level, clause level, sentence level, etc. Text vectors are highly correlated with other vectors used to generate prosody, such as phoneme vectors and voiceprint feature vectors. Text vectors can provide richer reference information for subsequent generation of prosody-related vectors, including but not limited to Text information and/or semantic information, etc. Preferably, the text vector can be at the character level. On the one hand, the correspondence with the phoneme vector is better;

It should be noted that in this step, the specific method of generating the corresponding phoneme vector and text vector based on the text to be converted, as well as the method of obtaining the voiceprint feature vector of the target human voice, can be adopted by those skilled in the art according to the actual situation. Such as a neural network model or an algorithm), which is not limited in this embodiment of the present application.

Step S204: Obtain the linguistic feature vector corresponding to the text to be converted according to the phoneme vector and the text vector; predict and obtain the hidden prosody vector of the text to be converted according to the text vector and the voiceprint feature vector.

In the embodiment of the present application, on the one hand, the text vector and the phoneme vector will be combined to generate a linguistic feature vector carrying prosodic information and semantic information; on the other hand, the text vector will be combined with the voiceprint feature vector to predict Hidden prosodic vectors that mainly carry prosodic information related to the text.

Although text vectors are used in both aspects, it can be seen from the above that the goals to be achieved by using text vectors are different. Therefore, in one possible way, the text vectors used by the two aspects can be obtained in different ways. For example, the text vector combined with the phoneme vector can be obtained through a character encoding network (also called a character encoder); A text vector of . As a result, the needs of different parts can be better met, and the overall solution is more flexible. Among them, the full name of the BERT model is: Bidirectional Encoder Representations from Transformer (bidirectional encoder representation from the converter).

Step S206: Generate speech spectrum information corresponding to the text to be converted according to the linguistic feature vector, hidden prosody vector and voiceprint feature vector.

After obtaining the linguistic feature vector and hidden prosody vector, combine the previously obtained voiceprint feature vectors, perform feature fusion and perform corresponding processing based on the fused features such as decoding processing, and then the speech spectrum information can be obtained, including the text to be converted rhythm information. In this embodiment of the present application, the prosodic information includes but not limited to intonation, speech rate, energy and spatial information, and the like.

In a feasible manner, the linguistic feature vector, the hidden prosody vector and the voiceprint feature vector may be concatenated to generate a concatenated vector; the concatenated vector may be decoded to obtain speech spectrum information corresponding to the text to be converted. Because the spliced vectors carry rich information associated with prosody in the text to be converted, the speech spectrum information obtained by decoding based on the spliced vectors is also more accurate.

In a feasible manner, the above process can be realized by a neural network model, which is called a prosody model in this application, and an exemplary prosody model is shown in FIG. 2B . As can be seen from Figure 2B, the prosody model includes: a phoneme encoding network (shown as a Phoneme Encoder in the figure), a text encoding network (shown as a character-level Word Encoder in the figure), and a hidden prosody vector prediction network (shown as an LPV Predictor in the figure). ), the vector splicing layer (shown in the figure as the dotted box where the "+" sign is located), and the decoding network (shown in the figure as the dotted box where the Decoder is located).

Among them, the phoneme encoding network is used to obtain the phoneme vector corresponding to the text to be converted; the text encoding network is used to obtain the text vector corresponding to the text to be converted; the hidden prosody vector prediction network is used to obtain the target person according to the text vector corresponding to the text to be converted The voiceprint feature vector of the sound is predicted to obtain the hidden prosody vector of the text to be converted; the vector splicing layer is used to add the phoneme vector and the text vector to obtain the linguistic feature vector corresponding to the text to be converted; and, the linguistic feature vector , the hidden prosody vector and the voiceprint feature vector are spliced to generate a spliced vector; the decoding network is used to decode the spliced vector to obtain the speech spectrum information corresponding to the text to be converted.

When using the prosodic model shown in Figure 2B, the solution of the embodiment of the present application can be implemented as follows: the phoneme vector corresponding to the text to be converted is obtained through the phoneme coding network of the prosodic model; and the text to be converted is obtained through the text coding network of the prosodic model The text vector corresponding to the text; through the hidden prosodic vector prediction network of the prosody model, predict the hidden prosodic vector of the text to be converted according to the text vector corresponding to the text to be converted and the acquired voiceprint feature vector of the target vocal; through the prosody model's vector The splicing layer adds the phoneme vector and the text vector to obtain the linguistic feature vector corresponding to the text to be converted; and splicing the linguistic feature vector, hidden prosodic vector and voiceprint feature vector to generate a splicing vector; The decoding network decodes the concatenated vectors to obtain the speech spectrum information corresponding to the text to be converted.

In addition, as shown in the figure, the decoding network part of the prosody model in this example is also provided with a Length Regulator and a Linear Layer. Among them, Length Regulator is used to adjust the lengths of linguistic feature vectors, hidden prosody vectors and voiceprint feature vectors so that their lengths are consistent with the voice spectrum information. The Linear Layer is used to linearize the output of the Decoder.

It can be seen from Fig. 2B that although Word Encoder and LPV Predictor both process "Word", in order to make "Word" better meet the needs of each part, the prosody model is also more flexible. In an optional way, Text encoding network includes character encoding network and context encoding network. Among them, the character encoding network, as shown in the Word Encoder in the figure, is used to encode the text to be converted at the character level, and generate a character text vector for summing up with the phoneme vector. The context encoding network can be such as BERT network or other network that can generate text vectors, which is used to encode the text to be converted at the character level, and generate character text vectors for inputting into the hidden prosody vector prediction network together with the voiceprint feature vector. However, as mentioned above, the two encoding networks may also adopt the same structure, which is also applicable to the solution of the embodiment of the present application.

Hereinafter, based on the above prosodic model, the data conversion method of this embodiment is exemplarily described from the perspective of the speech synthesis process, as shown in FIG. 2C .

The speech synthesis process usually includes three parts: front-end processing, acoustic model processing, and vocoder processing. Among them, the front-end processing is mainly to obtain pronunciation and linguistic information from the text to be converted, including but not limited to: text normalization (text standardization), font conversion (such as converting text characters into phonemes and other pronunciation information, so that The subsequent acoustic model can accurately obtain the pronunciation of the text character) and so on.

The acoustic model processing part is mainly completed by the acoustic model. In this example, the above-mentioned prosody model is implemented. The prosody model generates acoustic features based on the pronunciation information or linguistic information generated by the front-end processing, such as the Mel spectrogram. Specifically in this example, the prosody model outputs a mel-spectrogram based on the phonemes of the text to be converted, the text at the character level, and the voiceprint features of the target human voice to be converted. The process is as described above, and will not be repeated here.

The mel-spectrogram output by the prosody model will be input into the vocoder, and the vocoder will synthesize the final sound waveform based on the mel-spectrogram. Thus, the TTS conversion process from text to speech is completed.

In an example of a human-computer interaction scenario, the speech synthesis process includes: obtaining a response to a user instruction sent to the smart device, the response containing the text to be replied to the user instruction; obtaining the phoneme vector and text corresponding to the text to be replied vector and the voiceprint feature vector of the target voice; according to the phoneme vector and the text vector, obtain the linguistic feature vector corresponding to the text to be replied; according to the text vector and the voiceprint feature vector, predict and obtain the hidden prosody vector of the text to be replied; according to the language The learning feature vector, the hidden prosody vector and the voiceprint feature vector are used to generate the speech spectrum information corresponding to the text to be replied; the speech corresponding to the text to be replied is generated and played according to the speech spectrum information.

In this example, assume a human-computer interaction scenario, the example of a smart device is a smart speaker, the example of a user instruction is a voice question issued by the user, and the text to be replied corresponds to the reply to the voice question. Then, user X asks the smart speaker a voice question "what is 1 plus 1 equal to?". After receiving the voice question, the smart speaker converts it into text and sends it to the server for query; after obtaining the query result returned by the server, "1 plus 1 equals 2". After receiving the query result, the smart speaker will convert each character in the query result into a phoneme to form a phoneme sequence. And because smart speakers have their own voiceprint characteristics. Therefore, the smart speaker will use the corresponding phonemes and characters in the phoneme sequence and the voiceprint features as the input of the prosody model in the order of characters, and output the Mel spectrogram through the above-mentioned processing of the prosody model; then input the Mel spectrogram into the vocoder , and the final speech playback is synthesized by the vocoder. In this way, the reply to the voice question of user X is realized.

In Fig. 2C, for the convenience of explanation, the prosody model and the vocoder are shown separately, but those skilled in the art should understand that, in practical applications, the prosody model and the vocoder are both set in the smart speaker, through the smart Corresponding components in the loudspeaker, such as processors, control the execution.

In another example of a live broadcast scenario, the speech synthesis process may include: obtaining the live script text corresponding to the object to be broadcast live; obtaining the phoneme vector, text vector, and voiceprint feature vector of the target human voice corresponding to the live script text; according to the phoneme vector and Text vector, to obtain the linguistic feature vector corresponding to the live script text; predict and obtain the hidden prosodic vector of the live script text according to the text vector and voiceprint feature vector; generate the live broadcast according to the linguistic feature vector, hidden prosodic vector and voiceprint feature vector Voice spectrum information corresponding to the script text; generating live voice corresponding to the live script text according to the voice spectrum information.

Wherein, the live broadcast script corresponding to the live broadcast object can be a live broadcast script corresponding to multiple live broadcast objects (such as commodities, or content or programs, etc.), such as the script of the whole live broadcast, or one or some corresponding scripts of multiple live broadcast objects. live script. After obtaining the live broadcast script, the above-mentioned method can be used to finally convert the live broadcast script into live voice, so as to be applied to live broadcast scenarios, such as live broadcast delivery or live content promotion, and so on. The live broadcast voice can be adapted to a virtual anchor or a real anchor, and can be widely used in live broadcast scenarios.

In yet another broadcast scenario, the speech synthesis process may include: obtaining the script text to be broadcast; obtaining the phoneme vector, text vector, and voiceprint feature vector of the target vocal corresponding to the script text; obtaining the script according to the phoneme vector and the text vector The linguistic feature vector corresponding to the text; predict and obtain the hidden prosodic vector of the script text according to the text vector and the voiceprint feature vector; generate the voice spectrum information corresponding to the script text according to the linguistic feature vector, hidden prosodic vector and voiceprint feature vector; According to the voice spectrum information, the performance voice corresponding to the script text is generated.

Wherein, the script text to be performed includes one of the following: line scripts corresponding to audio or video, and text content of e-books. After the script text is obtained, the above-mentioned method can be used to finally convert the script text into a performance voice, so as to be applied to the performance scene. For example, the studio voice can be used to dub video characters, or realize audio generation, or realize audio e-books and so on.

It can be seen that through this embodiment, when acquiring the frequency spectrum of the text to be converted that needs to be converted into speech, the phonemes of the text to be converted, the voiceprint features of the text and the target human voice are taken into consideration. Among them, the linguistic features of the text to be converted can be obtained based on phonemes and texts, which carry the pronunciation features of the corresponding level of the text (such as character level, word level, sentence level, etc.); based on text and voiceprint features, it can be predicted and obtained The hidden prosodic vector of the text to be converted, which mainly contains prosody information. The prosody in this way is obtained based on the corresponding features of the text, and more attention is paid to the characteristics of the prosody itself. The speech spectrum information obtained after processing based on linguistic features, hidden prosodic vectors and voiceprint features is more in line with the speech characteristics of the target human voice corresponding to the actual voiceprint features, and is closer to the actual rhythm of the target human voice. for close. As a result, the subsequent speech generated based on the obtained speech spectrum information is also closer to the actual human voice.

It can be seen that through the solution of this embodiment, on the one hand, the prosody modeling is no longer based on the fundamental frequency, but the prosody information is extracted based on various information related to the prosody, which can make the extracted prosody more accurate; on the other hand On the one hand, the comprehensive consideration of the relationship between various factors affecting the prosody (such as phoneme, text, voiceprint of the target human voice, etc.) also makes the prosody thus obtained more accurate.

Embodiment two

Referring to FIG. 3A , it shows a flowchart of steps of a data conversion method according to Embodiment 2 of the present application.

In this embodiment, data conversion using a prosodic model is taken as an example, and a training process of the prosodic model is firstly introduced, and then data conversion is performed based on the trained prosody model.

Step S302: Obtain training samples, and use the training samples to train the prosodic model.

Wherein, the training sample includes the text sample to be converted and the corresponding voice sample, and the voiceprint feature sample vector. In the embodiment of the present application, the voice sample uses a low-frequency voice sample, such as a voice sample with a frequency band of 0-2KHz (kilohertz) . On the one hand, the low-frequency speech samples carry sufficient prosody-related information, which will not affect the training effect; on the other hand, removing the speech in frequency bands other than the low-frequency band can make the model structure simpler. However, it should be noted that the full-band voice samples are also applicable to the solutions of the embodiments of the present application. In addition, low-quality speech samples containing noise can also be used, and are no longer limited to high-quality speech samples. In this way, audio in video, conventional audio, broadcast audio, etc. can be used as speech samples in the embodiments of the present application, greatly improving The number and selection range of speech samples are enriched, and the acquisition cost of speech samples is reduced.

In the present embodiment, the prosody model is as shown in Figure 3B, which includes: phoneme encoding network (shown as Phoneme Encoder in the figure), text encoding network, prosody encoding network (shown as Prosody Encoder in the figure), hidden prosody vector prediction network ( The figure shows the LPV Predictor), the vector splicing layer (the figure shows the dotted box where the "+" sign is located), and the decoding network (the figure shows the Decoder where the dotted box is located).

Based on this structure, the training of the prosodic model includes: inputting the phoneme corresponding to the text sample to be converted into the phoneme encoding network to obtain the corresponding phoneme sample vector; inputting the characters of the text sample to be converted into the text encoding network to obtain the corresponding character sample text vector ; Speech samples, phoneme sample vectors, character sample text vectors and voiceprint feature sample vectors are input into the prosody encoding network to obtain the corresponding first hidden prosodic sample vector (LPV shown in the prosody model in Fig. 3B); based on the phoneme sample vector , the character sample text vector, the voiceprint feature sample vector and the first hidden prosodic sample vector to train the prosodic model.

Among them, in order to make the model more flexible, the text encoding network is divided into a character encoding network (shown as a character-level Word Encoder) and a context encoding network (shown as a Context Encoder in the upper right corner). Based on this, inputting the characters of the text to be converted into the text encoding network and obtaining the corresponding character sample text vector can be implemented as follows: input the characters of the text sample to be converted into the character encoding network and the context encoding network respectively, and obtain the corresponding first character sample text vector and the second character sample text vector. Correspondingly, inputting speech samples, phoneme sample vectors, character sample text vectors and voiceprint feature sample vectors into the prosodic encoding network, and obtaining the corresponding first hidden prosodic sample vectors can be realized as follows: speech samples, phoneme sample vectors, first character The sample text vector and the voiceprint feature sample vector are input into the prosody encoding network to obtain the corresponding first hidden prosody sample vector.

In addition, in this embodiment, in addition to the Decoder, the decoding network part is also provided with a Length Regulator and a Linear Layer. Among them, the Length Regulator is used to adjust the length of the linguistic feature sample vector, the first hidden prosody sample vector and the voiceprint feature sample vector, so that their lengths are consistent with the voice spectrum information. The Linear Layer is used to linearize the output of the Decoder.

Under this kind of structure, the training for the L-shaped dotted line box on the left side in Figure 3B includes: converting the text sequence of the input text sample to be converted into a phoneme sequence (shown as Phoneme in the figure) and a character sequence (shown as Word in the figure) , input phoneme encoding network Phoneme Encoder and character encoding network Word Encoder respectively. Then the phoneme sample vector Phoneme Eembedding is obtained through the Phoneme Encoder, and the first character sample text vector Word Eembedding is obtained through the Word Encoder. Furthermore, Phoneme Eembedding and Word Eembedding are summed to obtain the linguistic feature sample vector H_ling. Then, based on H_ling and H_spk (voiceprint feature sample vector, which is a vector), the mel-spec (mel-spec) of the sample human voice, that is, the low-frequency part of the voice sample (such as the 0-2KHz part) is obtained through the prosody encoding network Prosody Encoder The first hidden prosody vectors (latent prosody vectors, LPV). Then, H_ling, H_spk and the first hidden prosodic sample vector are spliced together and sent to the subsequent decoding network to obtain the predicted mel spectrum.

In this embodiment, the training process of the prosody encoding network Prosody Encoder can be exemplarily implemented as follows: through the first convolutional layer of the prosody encoding network based on the phoneme sample vector and the voiceprint feature sample vector, feature extraction is performed on the speech sample to obtain the first A prosodic sample feature; perform character-level pooling processing on the first prosodic sample feature through the pooling layer of the prosodic encoding network to obtain character-level prosodic sample features; pass the second convolutional layer of the prosodic encoding network based on the first character sample The text vector and the voiceprint feature sample vector extract the character-level prosodic sample features to obtain the second prosodic sample features; the second prosodic sample features are vectorized through the vectorization layer of the prosody coding network to obtain the first hidden Rhythm sample vector. In this way, the prosodic encoding network structure is simplified, and the hidden prosodic sample vector can be extracted effectively.

Exemplarily, as shown in part (b) of Figure 3B, the input of the prosody encoding network Prosody Encoder is the low-frequency part of the Mel spectrum of the voice sample corresponding to the text sample to be converted, Phoneme Eembedding and Word Eembedding (for simplified expression, In the text, it is simply indicated as H_ling) and H_spk, and the output is the first hidden prosodic sample vector sequence at the character level. The prosody encoding network Prosody Encoder contains two levels of Conv Stacks (convolution stack): when the first level of Conv Stacks processes the low frequency part of the Mel spectrum, the input includes Phoneme Eembedding and H_spk in addition to the low frequency part of the Mel spectrum. Through the addition of Phoneme Eembedding, the convolution processing of the low-frequency part of the Mel spectrum can filter out the influence of the phoneme on the prosody, and then the low-frequency part of the convolution-processed Mel spectrum is passed through the character-level pooling layer Word The pooling operation of -level Pooling is compressed to the character level; the second-level Conv Stacks is based on the output of the first-level Conv Stacks and Word Eembedding, H_spk to obtain the hidden prosodic expression, and the addition of Word Eembedding makes the low-frequency part of the Mel spectrum The convolution processing of can filter out the impact of character semantics on prosody; finally, based on this hidden prosody expression, the first hidden prosody sample vector sequence at the character level is obtained through the vector quantization layer (Vector Quantization).

After obtaining the first hidden prosodic sample vector, the prosodic model can be trained based on the phoneme sample vector, the first character sample text vector, the voiceprint feature sample vector and the first hidden prosodic sample vector. Specifically, it may include: adding the phoneme sample vector and the first character sample text vector through the vector splicing layer to obtain the linguistic feature vector; and, the linguistic feature vector, the voiceprint feature sample vector and the first hidden prosody sample The vectors are spliced to obtain the spliced sample vectors; the spliced sample vectors are decoded through the decoding network, and the prosody model is trained according to the decoding results.

In an optional solution, before decoding the spliced sample vectors through the decoding network, the length regularization process can also be performed on the spliced sample vectors through the length regularization layer; Sample vector to decode. Specifically, it may be shown in (a) of FIG. 3B .

In addition, the prosody encoding network Prosody Encoder not only participates in the training of the left L-shaped dashed box in Figure 3B(a), but also undertakes the training task of the hidden prosody vector prediction network LPV Predictor. In the reasoning stage of the prosody model, the prosody prediction will be mainly realized by the LPV Predictor, and the prosody encoding network Prosody Encoder will no longer function. Therefore, the training of the prosodic model also includes: inputting the second character sample text vector and the voiceprint feature sample vector into the hidden prosodic vector prediction network, and predicting and obtaining the second hidden prosodic sample vector; according to the first hidden prosodic sample vector and the second hidden Differences in prosodic sample vectors to train the hidden prosodic vector prediction network.

Among them, as mentioned above, the acquisition of the second character sample text vector can be realized by using the context encoding network Context Encoder in the upper right corner of Figure 3B, and its specific structure can adopt the BERT model structure. However, those skilled in the art should understand that other structures, such as any plain text training model structure, are also applicable to the solutions of the embodiments of the present application.

A simple schematic diagram of training the hidden rhythm vector prediction network is shown in the lower right corner of Fig. 3B. It can be seen that the prosody encoding network Prosody Encoder is based on the low-frequency part of the Mel spectrum of the speech sample (that is, the noise speech noisy audio shown in the lower right corner of Figure 3B), Phoneme Eembedding and Word Eembedding (to simplify the expression, the text is simply indicated as H_ling ) and H_spk, output the first hidden prosodic sample vector. LPV Predictor outputs the second hidden prosodic sample vector based on the character sequence of the text to be converted (shown as Word in the figure) and H_spk. In (d) of Fig. 3B, the two hidden prosody sample vectors are both shown as LPV. Based on these two LPVs and the preset loss function, the LPV Predictor can be trained. Wherein, the loss function may be any appropriate function, including but not limited to a distance function such as a cosine distance function, which is not limited in this embodiment of the present application.

LPV Predictor is an autoregressive prediction model, which can be seen from (c) in Figure 3B. On the one hand, LPV Predictor converts the input Word into a character vector through the Context Encoder. In order to distinguish it from the Word Eembedding output by the aforementioned Word Encoder, the Context The character vector output by the Encoder is expressed as Hi; on the other hand, when the LPV Predictor processes the current character, it also uses the LPV corresponding to the previous character (LPV _i-1 in the figure) as a reference. After performing self-attention calculation on LPV _i-1 , it is spliced with Hi and H_spk, and then the spliced vector is subjected to subsequent processing (such as normalization, convolution, etc.), where the normalization layer, for example, can be expressed as add&norm, where add means residual processing, norm means normalization processing, and the convolutional layer can be, for example, Conv1D (one-dimensional convolution), and finally obtain the prosody prediction result for the current character, that is, LPV _i . That is, the prediction process can be implemented as: inputting the second character sample text vector and the voiceprint feature sample vector corresponding to the current character to be predicted into the hidden prosodic vector prediction network; Perform feature fusion with the pattern feature sample vector and the second hidden prosodic sample vector corresponding to the previous character of the current character; based on the fused feature vector, predict and obtain the second hidden prosodic sample vector of the current character. More accurate prosodic information can be obtained through autoregressive methods.

Through the above process, the training of each part of the prosody model in this embodiment can be realized, and after the training is completed, data conversion from text to frequency spectrum can be performed.

Step S304: Obtain the phoneme vector, text vector and voiceprint feature vector of the target human voice corresponding to the text to be converted.

For example, using the prosody model in Figure 3B after the training is completed, the phoneme sequence of the text to be converted is encoded into a phoneme vector Phoneme Eembedding through its phoneme encoding network Phoneme Encoder; the character sequence of the text to be converted is converted through its character encoding network Word Encoder Word Eembedding is a character text vector. The voiceprint feature vector H_spk of the target voice can be obtained in advance, and the specific means of extracting the voiceprint feature vector based on the target voice is not limited in this embodiment of the present application.

Step S306: According to the phoneme vector and the text vector, obtain the linguistic feature vector corresponding to the text to be converted; according to the text vector and the voiceprint feature vector, predict and obtain the hidden prosodic vector of the text to be converted.

For example, use the prosody model in Figure 3B after training, add Phoneme Eembedding and Word Eembedding through the vector splicing layer to obtain the linguistic feature vector H_ling; obtain the hidden prosody vector LPV of the text to be converted through LPV Predictor.

Step S308: Generate speech spectrum information corresponding to the text to be converted according to the linguistic feature vector, hidden prosody vector and voiceprint feature vector.

For example, use the prosody model in Figure 3B after training to stitch H_ling, LPV, and H_spk through the vector stitching layer. Furthermore, the processing related to decoding is performed through the Length Regulator, Decoder and Linear Layer in the decoding network in turn, and finally the speech spectrum information corresponding to the text to be converted is obtained.

Further, on the basis of obtaining the speech spectrum information, the corresponding speech can be output through the vocoder, realizing the conversion from text to speech.

It should be noted that the descriptions of the above steps S304-S308 are relatively simple, and relevant parts may refer to the relevant descriptions in the foregoing first embodiment and step S302.

Through this embodiment, the hidden prosody vector is used to represent the prosody instead of the prosody component, which avoids the poor effect of prosody modeling caused by the inaccurate fundamental frequency extraction and the lack of correlation in the prediction of each prosody component in the traditional method, and obtains The frequency spectrum effect of , which in turn leads to the problem of poor speech synthesis. Through the solution of this embodiment, on the one hand, the prosody modeling is no longer based on the fundamental frequency, but the prosody information is extracted based on various information related to the prosody, which can make the extracted prosody more accurate; on the other hand, The comprehensive consideration of the relationship between various factors affecting the prosody (such as phoneme, text, voiceprint of the target human voice, etc.) also makes the prosody thus obtained more accurate.

Embodiment three

Referring to FIG. 4 , it shows a schematic structural diagram of an electronic device according to Embodiment 3 of the present application. The specific embodiment of the present application does not limit the specific implementation of the electronic device.

As shown in FIG. 4 , the electronic device may include: a processor (processor) 402, a communication interface (Communications Interface) 404, a memory (memory) 406, and a communication bus 408.

in:

The processor 402 , the communication interface 404 , and the memory 406 communicate with each other through the communication bus 408 .

The communication interface 404 is used for communicating with other electronic devices or servers.

The processor 402 is configured to execute the program 410, and specifically, may execute relevant steps in the foregoing data conversion method embodiments.

Specifically, the program 410 may include program codes including computer operation instructions.

The processor 402 may be a CPU, or an ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement the embodiments of the present application. The one or more processors included in the smart device may be of the same type, such as one or more CPUs, or may be different types of processors, such as one or more CPUs and one or more ASICs.

The memory 406 is used to store the program 410 . The memory 406 may include a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.

The program 410 may be specifically configured to enable the processor 402 to perform any operation described in any one of the foregoing data conversion method embodiments.

For the specific implementation of each step in the program 410, refer to the corresponding descriptions in the corresponding steps and units in the above-mentioned data conversion method embodiment and related method embodiments, and have corresponding beneficial effects, so details are not repeated here. Those skilled in the art can clearly understand that for the convenience and brevity of description, the specific working process of the above-described devices and modules can refer to the corresponding process description in the foregoing method embodiments, and details are not repeated here.

And, the embodiment of the present application also provides a data conversion device, including:

In a possible design, the text vector is a character text vector corresponding to each character in the text to be converted.

In a possible design, the data conversion method is performed by a prosody model, and the prosody model at least includes: a phoneme encoding network, a text encoding network, a hidden prosody vector prediction network, a vector splicing layer, and a decoding network;

The phoneme encoding network is used to obtain the phoneme vector corresponding to the text to be converted;

The text encoding network is used to obtain a text vector corresponding to the text to be converted;

The hidden prosody vector prediction network is used to predict and obtain the hidden prosody vector of the text to be converted according to the text vector corresponding to the text to be converted and the acquired voiceprint feature vector of the target human voice;

The vector splicing layer is used to add the phoneme vector and the text vector to obtain the linguistic feature vector corresponding to the text to be converted; and, for the linguistic feature vector and the hidden prosody vector Splicing with the voiceprint feature vector to generate a splicing vector;

The decoding network is configured to decode the splicing vector to obtain speech spectrum information corresponding to the text to be converted.

In a possible design, the text encoding network includes a character encoding network and a context encoding network;

The character encoding network is used to perform character-level encoding on the text to be converted, and generate a character text vector for adding to the phoneme vector;

The context encoding network is used to perform character-level encoding on the text to be converted, and generate a character text vector for inputting the hidden prosody vector prediction network together with the voiceprint feature vector.

In a possible design, the acquiring module is also used for:

Obtain a training sample, the training sample includes a text sample to be converted, a corresponding voice sample, and a voiceprint feature sample vector, and the voice sample is a voice sample with a frequency band of 0-2 kHz;

The device also includes a processing module;

The processing module is used to train the prosody model by using the training samples.

In a possible design, the prosodic model further includes a prosodic coding network;

The processing module is specifically used for:

Input the phoneme corresponding to the text sample to be converted into the phoneme encoding network to obtain a corresponding phoneme sample vector; input the characters of the text sample to be converted into the text encoding network to obtain a corresponding character sample text vector;

Inputting the voice sample, the phoneme sample vector, the character sample text vector and the voiceprint feature sample vector into a prosodic coding network to obtain a corresponding first hidden prosodic sample vector;

The prosody model is trained based on the phoneme sample vector, the character sample text vector, the voiceprint feature sample vector and the first hidden prosodic sample vector.

In a possible design, the processing module is specifically used for:

The characters of the text to be converted are respectively input into the character encoding network and the context encoding network to obtain corresponding first character sample text vectors and second character sample text vectors;

The inputting the speech sample, the phoneme sample vector, the character sample text vector and the voiceprint feature sample vector into the prosodic coding network to obtain the corresponding first hidden prosodic sample vector includes: , the phoneme sample vector, the first character sample text vector and the voiceprint feature sample vector are input into a prosodic encoding network to obtain a corresponding first hidden prosodic sample vector.

In a possible design, the processing module is specifically used for:

Through the first convolutional layer of the prosodic coding network, based on the phoneme sample vector and the voiceprint feature sample vector, feature extraction is performed on the speech sample to obtain a first prosodic sample feature;

Perform character-level pooling processing on the first prosodic sample features through the pooling layer of the prosodic encoding network to obtain character-level prosodic sample features;

Through the second convolutional layer of the prosodic encoding network, based on the first character sample text vector and the voiceprint feature sample vector, feature extraction is performed on the prosodic sample features at the character level to obtain second prosodic sample features;

The second prosodic sample feature is vectorized by the vectorization layer of the prosodic encoding network to obtain a first hidden prosodic sample vector.

In a possible design, the processing module is specifically used for:

Inputting the second character sample text vector and the voiceprint feature sample vector into the hidden prosodic vector prediction network, and predicting and obtaining the second hidden prosodic sample vector;

The hidden prosodic vector prediction network is trained according to the difference between the first hidden prosodic sample vector and the second hidden prosodic sample vector.

An embodiment of the present application further provides a computer program product, including computer instructions, where the computer instruction instructs a computing device to perform operations corresponding to any one of the data conversion methods in the above multiple method embodiments.

It should be noted that the input of the prosodic coding network in multiple embodiments of the present application is exemplified by Mel spectrum, but not limited thereto, other acoustic features (such as LPC fea, MFCC, fbank, raw wave, etc.) are also applicable .

It should be pointed out that, according to the needs of implementation, each component/step described in the embodiment of the present application can be divided into more components/steps, and two or more components/steps or partial operations of components/steps can also be combined into New components/steps to achieve the purpose of the embodiment of the present application.

The above method according to the embodiment of the present application can be implemented in hardware, firmware, or as software or computer code that can be stored in a recording medium (such as CD ROM, RAM, floppy disk, hard disk, or magneto-optical disk), or implemented by Computer code downloaded from a network that is originally stored on a remote recording medium or a non-transitory machine-readable medium and will be stored on a local recording medium so that the methods described herein can be stored on a computer code using a general-purpose computer, a dedicated processor, or a programmable Such software processing on a recording medium of dedicated hardware such as ASIC or FPGA. It will be appreciated that a computer, processor, microprocessor controller, or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when When accessed and executed by a processor or hardware, the data conversion methods described herein are implemented. Furthermore, when a general-purpose computer accesses the code for implementing the data conversion method shown here, the execution of the code converts the general-purpose computer into a special-purpose computer for executing the data conversion method shown here.

Those skilled in the art can appreciate that the units and method steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Professionals and technicians may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the embodiments of the present application.

The above implementations are only used to illustrate the embodiments of the application, rather than to limit the embodiments of the application. Those of ordinary skill in the relevant technical fields can also make various implementations without departing from the spirit and scope of the embodiments of the application Changes and modifications, so all equivalent technical solutions also belong to the scope of the embodiments of the present application, and the scope of patent protection of the embodiments of the present application should be defined by the claims.

Claims

A data conversion method comprising:

Obtain the phoneme vector, text vector and voiceprint feature vector of the target human voice corresponding to the text to be converted;

Obtaining a linguistic feature vector corresponding to the text to be converted according to the phoneme vector and the text vector; predicting and obtaining a hidden prosodic vector of the text to be converted according to the text vector and the voiceprint feature vector;

Speech spectrum information corresponding to the text to be converted is generated according to the linguistic feature vector, the hidden prosody vector, and the voiceprint feature vector.
The method according to claim 1, wherein the text vector is a character text vector corresponding to each character in the text to be converted.
The method according to claim 1 or 2, wherein the data conversion method is performed by a prosody model, and the prosody model at least includes: a phoneme encoding network, a text encoding network, a hidden prosody vector prediction network, a vector splicing layer and a decoding network ;

The phoneme encoding network is used to obtain the phoneme vector corresponding to the text to be converted;

The text encoding network is used to obtain a text vector corresponding to the text to be converted;

The hidden prosody vector prediction network is used to predict and obtain the hidden prosody vector of the text to be converted according to the text vector corresponding to the text to be converted and the acquired voiceprint feature vector of the target human voice;

The vector splicing layer is used to add the phoneme vector and the text vector to obtain the linguistic feature vector corresponding to the text to be converted; and, for the linguistic feature vector and the hidden prosody vector Splicing with the voiceprint feature vector to generate a splicing vector;

The decoding network is configured to decode the splicing vector to obtain speech spectrum information corresponding to the text to be converted.
The method of claim 3, wherein the text encoding network comprises a character encoding network and a context encoding network;

The character encoding network is used to perform character-level encoding on the text to be converted, and generate a character text vector for adding to the phoneme vector;

The context encoding network is used to perform character-level encoding on the text to be converted, and generate a character text vector for inputting into the hidden prosodic vector prediction network together with the voiceprint feature vector.
The method according to claim 4, wherein the method further comprises:

Obtain a training sample, the training sample includes a text sample to be converted, a corresponding voice sample, and a voiceprint feature sample vector, and the voice sample is a voice sample with a frequency band of 0-2 kHz;

The prosodic model is trained using the training samples.
The method according to claim 5, wherein the prosodic model further comprises a prosodic coding network;

The training of the prosodic model using the training samples includes:

Input the phoneme corresponding to the text sample to be converted into the phoneme encoding network to obtain a corresponding phoneme sample vector; input the characters of the text sample to be converted into the text encoding network to obtain a corresponding character sample text vector;

Inputting the voice sample, the phoneme sample vector, the character sample text vector and the voiceprint feature sample vector into a prosodic coding network to obtain a corresponding first hidden prosodic sample vector;

The prosody model is trained based on the phoneme sample vector, the character sample text vector, the voiceprint feature sample vector and the first hidden prosodic sample vector.
The method according to claim 6, wherein said inputting the character of said text sample to be converted into a text encoding network to obtain a corresponding character sample text vector comprises:

The characters of the text to be converted are respectively input into the character encoding network and the context encoding network to obtain corresponding first character sample text vectors and second character sample text vectors;

The inputting the speech sample, the phoneme sample vector, the character sample text vector and the voiceprint feature sample vector into the prosodic coding network to obtain the corresponding first hidden prosodic sample vector includes: , the phoneme sample vector, the first character sample text vector and the voiceprint feature sample vector are input into a prosodic encoding network to obtain a corresponding first hidden prosodic sample vector.
The method according to claim 7, wherein said inputting said voice sample, said phoneme sample vector, said first character sample text vector and said voiceprint feature sample vector into a prosody coding network to obtain the corresponding first A hidden rhythm sample vector, including:

Through the first convolutional layer of the prosodic coding network, based on the phoneme sample vector and the voiceprint feature sample vector, feature extraction is performed on the speech sample to obtain a first prosodic sample feature;

Perform character-level pooling processing on the first prosodic sample features through the pooling layer of the prosodic encoding network to obtain character-level prosodic sample features;

Through the second convolutional layer of the prosodic encoding network, based on the first character sample text vector and the voiceprint feature sample vector, feature extraction is performed on the prosodic sample features at the character level to obtain second prosodic sample features;

The second prosodic sample feature is vectorized by the vectorization layer of the prosodic encoding network to obtain a first hidden prosodic sample vector.
The method according to claim 7, wherein said prosodic model is performed based on said phoneme sample vector, said character sample text vector, said voiceprint feature sample vector and said first hidden prosodic sample vector training, including:

Inputting the second character sample text vector and the voiceprint feature sample vector into the hidden prosodic vector prediction network, and predicting and obtaining the second hidden prosodic sample vector;

The hidden prosodic vector prediction network is trained according to the difference between the first hidden prosodic sample vector and the second hidden prosodic sample vector.
A data conversion method comprising:

Obtaining a response to the user instruction sent to the smart device, the response includes text to be replied to the user instruction;

Acquiring the phoneme vector, text vector and voiceprint feature vector of the target human voice corresponding to the text to be replied;

Obtaining a linguistic feature vector corresponding to the text to be replied according to the phoneme vector and the text vector; predicting and obtaining a hidden prosody vector of the text to be replied according to the text vector and the voiceprint feature vector;

generating speech spectrum information corresponding to the text to be replied according to the linguistic feature vector, the hidden prosody vector and the voiceprint feature vector;

Generate and play the voice corresponding to the text to be replied according to the voice spectrum information.
A data conversion method comprising:

Obtain the live script text corresponding to the object to be broadcasted;

Obtaining the phoneme vector, text vector and voiceprint feature vector of the target human voice corresponding to the live script text;

Obtaining a linguistic feature vector corresponding to the live script text according to the phoneme vector and the text vector; predicting and obtaining a hidden prosody vector of the live script text according to the text vector and the voiceprint feature vector;

generating voice spectrum information corresponding to the live script text according to the linguistic feature vector, the hidden prosody vector and the voiceprint feature vector;

Generate live voice corresponding to the live script text according to the voice spectrum information.
A data conversion method comprising:

Obtain the script text to be played, wherein, the script text to be played includes one of the following: audio or video corresponding line script, e-book text content;

Acquiring the phoneme vector, text vector and voiceprint feature vector of the target human voice corresponding to the script text;

Obtaining a linguistic feature vector corresponding to the script text according to the phoneme vector and the text vector; predicting and obtaining a hidden prosody vector of the script text according to the text vector and the voiceprint feature vector;

generating speech spectrum information corresponding to the script text according to the linguistic feature vector, the hidden prosody vector and the voiceprint feature vector;

Generate the performance voice corresponding to the script text according to the voice spectrum information.
A data conversion method comprising:

Obtaining the phoneme vector corresponding to the text to be converted through the phoneme coding network of the prosodic model; and obtaining the text vector corresponding to the text to be converted through the text coding network of the prosodic model;

Predicting and obtaining the hidden prosodic vector of the text to be converted through the hidden prosodic vector prediction network of the prosody model according to the text vector corresponding to the text to be converted and the acquired voiceprint feature vector of the target human voice;

Adding the phoneme vector and the text vector through the vector splicing layer of the prosody model to obtain a linguistic feature vector corresponding to the text to be converted; and, for the linguistic feature vector, the hidden prosody The vector and the voiceprint feature vector are spliced to generate a spliced vector;

The splicing vector is decoded by the decoding network of the prosody model to obtain the speech spectrum information corresponding to the text to be converted.
A data conversion device, comprising:

The obtaining module is used to obtain the voiceprint feature vector of the phoneme vector, text vector and target human voice corresponding to the text to be converted;

The obtaining module is also used to obtain a linguistic feature vector corresponding to the text to be converted according to the phoneme vector and the text vector;

A prediction module, configured to predict and obtain the hidden prosody vector of the text to be converted according to the text vector and the voiceprint feature vector;

A generating module, configured to generate speech spectrum information corresponding to the text to be converted according to the linguistic feature vector, the hidden prosodic vector, and the voiceprint feature vector.
An electronic device comprising:

memory for storing programs;

A processor, configured to execute the program stored in the memory, and when the program is executed, the processor is configured to execute the data conversion method according to any one of claims 1 to 13.
A computer storage medium, on which a computer program is stored, and when the program is executed by a processor, the data conversion method according to any one of claims 1-13 is realized.
A computer program product, comprising a computer program, characterized in that, when the computer program is executed by a processor, the data conversion method according to any one of claims 1 to 13 is implemented.