WO2023116243A1 - Data conversion method and computer storage medium - Google Patents

Data conversion method and computer storage medium Download PDF

Info

Publication number
WO2023116243A1
WO2023116243A1 PCT/CN2022/130735 CN2022130735W WO2023116243A1 WO 2023116243 A1 WO2023116243 A1 WO 2023116243A1 CN 2022130735 W CN2022130735 W CN 2022130735W WO 2023116243 A1 WO2023116243 A1 WO 2023116243A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
text
sample
prosodic
hidden
Prior art date
Application number
PCT/CN2022/130735
Other languages
French (fr)
Chinese (zh)
Inventor
任意
雷鸣
黄智颖
张仕良
陈谦
鄢志杰
Original Assignee
阿里巴巴达摩院(杭州)科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴达摩院(杭州)科技有限公司 filed Critical 阿里巴巴达摩院(杭州)科技有限公司
Publication of WO2023116243A1 publication Critical patent/WO2023116243A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser

Definitions

  • the embodiments of the present application relate to the field of computer technologies, and in particular, to a data conversion method and a computer storage medium.
  • Speech synthesis technology also known as Text to Speech (Text to Speech) technology
  • Text to Speech can convert text information into standard and smooth speech, which is equivalent to installing an artificial mouth on a machine.
  • high-expressive speech synthesis is required.
  • This kind of speech synthesis needs to model prosody, and the expressiveness of speech synthesis is improved through the prosody model.
  • prosodic components include: fundamental frequency, energy, and duration.
  • Existing prosody modeling is usually constructed based on the fundamental frequency features of the prosody, but on the one hand, due to the inaccurate extraction of the fundamental frequency, the effect of prosody modeling is poor, which further leads to inaccurate prosody information obtained from it; on the other hand, Failure to take into account the correlation between factors affecting prosody also results in poor prosody modeling and inaccurate prosody information.
  • the prosody information obtained based on the current prosody modeling method has the problem of poor accuracy.
  • an embodiment of the present application provides a data conversion solution to at least partially solve the above problem.
  • a data conversion method including:
  • a data conversion method including:
  • the response includes text to be replied to the user instruction
  • a data conversion method including:
  • a data conversion method including:
  • the script text to be played includes one of the following: audio or video corresponding line script, e-book text content;
  • a data conversion method including:
  • the splicing vector is decoded by the decoding network of the prosody model to obtain the speech spectrum information corresponding to the text to be converted.
  • a data conversion device including:
  • the obtaining module is used to obtain the voiceprint feature vector of the phoneme vector, text vector and target human voice corresponding to the text to be converted;
  • the obtaining module is also used to obtain a linguistic feature vector corresponding to the text to be converted according to the phoneme vector and the text vector;
  • a prediction module configured to predict and obtain the hidden prosody vector of the text to be converted according to the text vector and the voiceprint feature vector;
  • a generating module configured to generate speech spectrum information corresponding to the text to be converted according to the linguistic feature vector, the hidden prosodic vector, and the voiceprint feature vector.
  • a data conversion device including:
  • An acquisition module configured to acquire a response to a user instruction sent to the smart device, where the response contains text to be replied to the user instruction;
  • the obtaining module is also used to obtain the phoneme vector, text vector and voiceprint feature vector of the target human voice corresponding to the text to be replied;
  • the obtaining module is also used to obtain a linguistic feature vector corresponding to the text to be replied according to the phoneme vector and the text vector;
  • a prediction module configured to predict and obtain the hidden prosody vector of the text to be replied according to the text vector and the voiceprint feature vector;
  • a generation module configured to generate speech spectrum information corresponding to the text to be replied according to the linguistic feature vector, the hidden prosody vector and the voiceprint feature vector;
  • a processing module configured to generate and play the voice corresponding to the text to be replied according to the voice spectrum information.
  • a data conversion device including:
  • An acquisition module configured to acquire the live script text corresponding to the object to be broadcasted
  • the acquiring module is also used to acquire the phoneme vector, text vector and voiceprint feature vector of the target human voice corresponding to the live script text;
  • the obtaining module is also used to obtain a linguistic feature vector corresponding to the live script text according to the phoneme vector and the text vector;
  • a prediction module configured to predict and obtain the hidden prosody vector of the live script text according to the text vector and the voiceprint feature vector;
  • a generating module configured to generate voice spectrum information corresponding to the live script text according to the linguistic feature vector, the hidden prosody vector and the voiceprint feature vector;
  • a processing module configured to generate live speech corresponding to the live script text according to the speech spectrum information.
  • a data conversion device including:
  • the obtaining module is used to obtain the script text to be played, wherein the script text to be played includes one of the following: line scripts corresponding to audio or video, and text content of e-books;
  • the acquisition module is also used to acquire the phoneme vector, text vector and voiceprint feature vector of the target human voice corresponding to the script text;
  • the obtaining module is also used to obtain a linguistic feature vector corresponding to the script text according to the phoneme vector and the text vector;
  • a prediction module configured to predict and obtain the hidden prosody vector of the script text according to the text vector and the voiceprint feature vector
  • a generating module configured to generate speech spectrum information corresponding to the script text according to the linguistic feature vector, the hidden prosody vector and the voiceprint feature vector;
  • a processing module configured to generate a performance voice corresponding to the script text according to the voice spectrum information.
  • a data conversion device including:
  • An acquisition module configured to acquire a phoneme vector corresponding to the text to be converted through the phoneme encoding network of the prosodic model; and acquire a text vector corresponding to the text to be converted through the text encoding network of the prosodic model;
  • the prediction module is used to predict and obtain the hidden prosodic vector of the text to be converted through the hidden prosodic vector prediction network of the prosody model according to the text vector corresponding to the text to be converted and the acquired voiceprint feature vector of the target human voice;
  • the acquisition module is also used to add the phoneme vector and the text vector through the vector splicing layer of the prosody model to obtain the linguistic feature vector corresponding to the text to be converted;
  • a generating module configured to splice the linguistic feature vector, the hidden prosody vector, and the voiceprint feature vector to generate a stitching vector
  • a processing module configured to decode the concatenation vector through the decoding network of the prosody model to obtain speech spectrum information corresponding to the text to be converted.
  • an electronic device including:
  • a processor configured to execute the program stored in the memory, and when the program is executed, the processor is configured to execute the data conversion method according to any one of the first aspect to the fifth aspect.
  • a computer storage medium on which a computer program is stored, and when the program is executed by a processor, the data conversion as described in any one of the first aspect to the fifth aspect is realized method.
  • a computer program product including a computer program, when the computer program is executed by a processor, it implements the data conversion method described in any one of the first to fifth aspects above .
  • the phonemes of the text to be converted, the voiceprint features of the text and the target human voice are taken into consideration.
  • the linguistic features of the text to be converted can be obtained based on phonemes and texts, which carry the pronunciation features of the corresponding level of the text (such as character level, word level, sentence level, etc.); based on text and voiceprint features, it can be predicted and obtained
  • the hidden prosodic vector of the text to be converted which mainly contains prosody information. The prosody in this way is obtained based on the corresponding features of the text, and more attention is paid to the characteristics of the prosody itself.
  • the speech spectrum information obtained after processing based on linguistic features, hidden prosodic vectors and voiceprint features is more in line with the speech characteristics of the target human voice corresponding to the actual voiceprint features, and is closer to the actual rhythm of the target human voice. for close. As a result, the subsequent speech generated based on the obtained speech spectrum information is also closer to the actual human voice.
  • the prosody modeling is no longer based on the fundamental frequency, but the prosody information is extracted based on various information related to the prosody, which can make the extracted prosody more accurate; on the other hand
  • the comprehensive consideration of the relationship among various factors affecting the prosody also makes the prosody thus obtained more accurate.
  • FIG. 1 is a schematic diagram of an exemplary system applicable to the data conversion method of the embodiment of the present application
  • FIG. 2A is a flowchart of steps of a data conversion method according to Embodiment 1 of the present application.
  • Fig. 2B is a schematic diagram of a model example in the embodiment shown in Fig. 2A;
  • FIG. 2C is a schematic diagram of a scenario example in the embodiment shown in FIG. 2A;
  • FIG. 3A is a flowchart of steps of a data conversion method according to Embodiment 2 of the present application.
  • Fig. 3B is a schematic diagram of a model and an example of its training process in the embodiment shown in Fig. 3A;
  • FIG. 4 is a schematic structural diagram of an electronic device according to Embodiment 3 of the present application.
  • Fig. 1 shows an exemplary system applicable to the data conversion method of the embodiment of the present application.
  • the system 100 may include a server 102, a communication network 104, and/or one or more user devices 106, exemplified in FIG. 1 as a plurality of user devices.
  • Server 102 may be any suitable server for storing information, data, programs, and/or any other suitable type of content.
  • server 102 may perform any suitable function.
  • the server 102 may be used to determine the speech spectrum information to be used in the speech synthesis process.
  • the server 102 may be used to determine the corresponding speech spectrum information based on the text to be converted, and then perform speech synthesis based on the speech spectrum information.
  • the server 102 may determine the corresponding voice spectrum information based on the phoneme corresponding to the text to be converted, the text, and the voiceprint of the target human voice.
  • communication network 104 may be any suitable combination of one or more wired and/or wireless networks.
  • communication network 104 can include any one or more of the following: Internet, Intranet, Wide Area Network (Wide Area Network, WAN), Local Area Network (Local Area Network, LAN), wireless network, Digital Subscriber Line (DSL) Subscriber Line (DSL) network, frame relay network, asynchronous transfer mode (Asynchronous Transfer Mode, ATM) network, virtual private network (Virtual Private Network, VPN) and/or any other suitable communication network.
  • User device 106 can be connected to communication network 104 via one or more communication links (eg, communication link 112), which can be linked to Server 102.
  • the communication link may be any communication link suitable for transferring data between the user device 106 and the server 102, such as a network link, a dial-up link, a wireless link, a hardwired link, any other suitable communication link or any suitable combination of such links.
  • the user equipment 106 may include any one or more user equipment suitable for presenting an interface for information input and output, and for playing voice.
  • user equipment 106 may comprise any suitable type of equipment.
  • user devices 106 may include IoT devices, mobile devices, tablet computers, laptop computers, desktop computers, wearable computers, game consoles, media players, vehicle entertainment systems, and/or any other Appropriate type of user equipment. Note that, in some embodiments, if the user equipment 106 has higher software and hardware performance, it can also replace the function of the server 102 .
  • server 102 is illustrated as one device, in some embodiments any suitable number of devices may be used to perform the functions performed by server 102 . For example, in some embodiments, multiple devices may be used to implement the functions performed by server 102 . Alternatively, the functions of the server 102 may be implemented using cloud services.
  • an embodiment of the present application provides a data conversion method, which will be described below through multiple embodiments.
  • FIG. 2A it shows a flowchart of steps of a data conversion method according to Embodiment 1 of the present application.
  • Step S202 Obtain the phoneme vector, text vector and voiceprint feature vector of the target human voice corresponding to the text to be converted.
  • a phoneme is the smallest unit of speech divided according to the natural properties of speech. It is analyzed according to the pronunciation action in a syllable, and an action constitutes a phoneme. For example, ⁇ (a) has only one phoneme, ⁇ (ge) has two phonemes, and so on.
  • phonemes are an important consideration and conversion basis in the process of converting text to speech. In the specific conversion process, it is necessary to determine what kind of human voice to convert the text into. Therefore, the voiceprint feature needs to be used as a reference to finally generate a voice similar to the target human voice.
  • text vector of the text to be converted is also used in the embodiment of the present application.
  • text vectors can adopt different levels, such as phoneme level, character level, word level, clause level, sentence level, etc.
  • Text vectors are highly correlated with other vectors used to generate prosody, such as phoneme vectors and voiceprint feature vectors.
  • Text vectors can provide richer reference information for subsequent generation of prosody-related vectors, including but not limited to Text information and/or semantic information, etc.
  • the text vector can be at the character level. On the one hand, the correspondence with the phoneme vector is better;
  • the specific method of generating the corresponding phoneme vector and text vector based on the text to be converted, as well as the method of obtaining the voiceprint feature vector of the target human voice can be adopted by those skilled in the art according to the actual situation.
  • Such as a neural network model or an algorithm which is not limited in this embodiment of the present application.
  • Step S204 Obtain the linguistic feature vector corresponding to the text to be converted according to the phoneme vector and the text vector; predict and obtain the hidden prosody vector of the text to be converted according to the text vector and the voiceprint feature vector.
  • the text vector and the phoneme vector will be combined to generate a linguistic feature vector carrying prosodic information and semantic information; on the other hand, the text vector will be combined with the voiceprint feature vector to predict Hidden prosodic vectors that mainly carry prosodic information related to the text.
  • the text vectors used by the two aspects can be obtained in different ways.
  • the text vector combined with the phoneme vector can be obtained through a character encoding network (also called a character encoder); A text vector of .
  • a character encoding network also called a character encoder
  • a text vector of the full name of the BERT model is: Bidirectional Encoder Representations from Transformer (bidirectional encoder representation from the converter).
  • Step S206 Generate speech spectrum information corresponding to the text to be converted according to the linguistic feature vector, hidden prosody vector and voiceprint feature vector.
  • the prosodic information includes but not limited to intonation, speech rate, energy and spatial information, and the like.
  • the linguistic feature vector, the hidden prosody vector and the voiceprint feature vector may be concatenated to generate a concatenated vector; the concatenated vector may be decoded to obtain speech spectrum information corresponding to the text to be converted. Because the spliced vectors carry rich information associated with prosody in the text to be converted, the speech spectrum information obtained by decoding based on the spliced vectors is also more accurate.
  • the above process can be realized by a neural network model, which is called a prosody model in this application, and an exemplary prosody model is shown in FIG. 2B .
  • the prosody model includes: a phoneme encoding network (shown as a Phoneme Encoder in the figure), a text encoding network (shown as a character-level Word Encoder in the figure), and a hidden prosody vector prediction network (shown as an LPV Predictor in the figure). ), the vector splicing layer (shown in the figure as the dotted box where the "+" sign is located), and the decoding network (shown in the figure as the dotted box where the Decoder is located).
  • the phoneme encoding network is used to obtain the phoneme vector corresponding to the text to be converted; the text encoding network is used to obtain the text vector corresponding to the text to be converted; the hidden prosody vector prediction network is used to obtain the target person according to the text vector corresponding to the text to be converted
  • the voiceprint feature vector of the sound is predicted to obtain the hidden prosody vector of the text to be converted;
  • the vector splicing layer is used to add the phoneme vector and the text vector to obtain the linguistic feature vector corresponding to the text to be converted; and, the linguistic feature vector , the hidden prosody vector and the voiceprint feature vector are spliced to generate a spliced vector;
  • the decoding network is used to decode the spliced vector to obtain the speech spectrum information corresponding to the text to be converted.
  • the solution of the embodiment of the present application can be implemented as follows: the phoneme vector corresponding to the text to be converted is obtained through the phoneme coding network of the prosodic model; and the text to be converted is obtained through the text coding network of the prosodic model The text vector corresponding to the text; through the hidden prosodic vector prediction network of the prosody model, predict the hidden prosodic vector of the text to be converted according to the text vector corresponding to the text to be converted and the acquired voiceprint feature vector of the target vocal; through the prosody model's vector
  • the splicing layer adds the phoneme vector and the text vector to obtain the linguistic feature vector corresponding to the text to be converted; and splicing the linguistic feature vector, hidden prosodic vector and voiceprint feature vector to generate a splicing vector;
  • the decoding network decodes the concatenated vectors to obtain the speech spectrum information corresponding to the text to be converted.
  • the decoding network part of the prosody model in this example is also provided with a Length Regulator and a Linear Layer.
  • Length Regulator is used to adjust the lengths of linguistic feature vectors, hidden prosody vectors and voiceprint feature vectors so that their lengths are consistent with the voice spectrum information.
  • the Linear Layer is used to linearize the output of the Decoder.
  • Text encoding network includes character encoding network and context encoding network.
  • the character encoding network as shown in the Word Encoder in the figure, is used to encode the text to be converted at the character level, and generate a character text vector for summing up with the phoneme vector.
  • the context encoding network can be such as BERT network or other network that can generate text vectors, which is used to encode the text to be converted at the character level, and generate character text vectors for inputting into the hidden prosody vector prediction network together with the voiceprint feature vector.
  • the two encoding networks may also adopt the same structure, which is also applicable to the solution of the embodiment of the present application.
  • the speech synthesis process usually includes three parts: front-end processing, acoustic model processing, and vocoder processing.
  • the front-end processing is mainly to obtain pronunciation and linguistic information from the text to be converted, including but not limited to: text normalization (text standardization), font conversion (such as converting text characters into phonemes and other pronunciation information, so that The subsequent acoustic model can accurately obtain the pronunciation of the text character) and so on.
  • the acoustic model processing part is mainly completed by the acoustic model.
  • the above-mentioned prosody model is implemented.
  • the prosody model generates acoustic features based on the pronunciation information or linguistic information generated by the front-end processing, such as the Mel spectrogram.
  • the prosody model outputs a mel-spectrogram based on the phonemes of the text to be converted, the text at the character level, and the voiceprint features of the target human voice to be converted. The process is as described above, and will not be repeated here.
  • the mel-spectrogram output by the prosody model will be input into the vocoder, and the vocoder will synthesize the final sound waveform based on the mel-spectrogram.
  • the TTS conversion process from text to speech is completed.
  • the speech synthesis process includes: obtaining a response to a user instruction sent to the smart device, the response containing the text to be replied to the user instruction; obtaining the phoneme vector and text corresponding to the text to be replied vector and the voiceprint feature vector of the target voice; according to the phoneme vector and the text vector, obtain the linguistic feature vector corresponding to the text to be replied; according to the text vector and the voiceprint feature vector, predict and obtain the hidden prosody vector of the text to be replied; according to the language
  • the learning feature vector, the hidden prosody vector and the voiceprint feature vector are used to generate the speech spectrum information corresponding to the text to be replied; the speech corresponding to the text to be replied is generated and played according to the speech spectrum information.
  • the example of a smart device is a smart speaker
  • the example of a user instruction is a voice question issued by the user
  • the text to be replied corresponds to the reply to the voice question.
  • user X asks the smart speaker a voice question "what is 1 plus 1 equal to?”.
  • the smart speaker converts it into text and sends it to the server for query; after obtaining the query result returned by the server, "1 plus 1 equals 2".
  • the smart speaker will convert each character in the query result into a phoneme to form a phoneme sequence. And because smart speakers have their own voiceprint characteristics.
  • the smart speaker will use the corresponding phonemes and characters in the phoneme sequence and the voiceprint features as the input of the prosody model in the order of characters, and output the Mel spectrogram through the above-mentioned processing of the prosody model; then input the Mel spectrogram into the vocoder , and the final speech playback is synthesized by the vocoder. In this way, the reply to the voice question of user X is realized.
  • the prosody model and the vocoder are shown separately, but those skilled in the art should understand that, in practical applications, the prosody model and the vocoder are both set in the smart speaker, through the smart Corresponding components in the loudspeaker, such as processors, control the execution.
  • the speech synthesis process may include: obtaining the live script text corresponding to the object to be broadcast live; obtaining the phoneme vector, text vector, and voiceprint feature vector of the target human voice corresponding to the live script text; according to the phoneme vector and Text vector, to obtain the linguistic feature vector corresponding to the live script text; predict and obtain the hidden prosodic vector of the live script text according to the text vector and voiceprint feature vector; generate the live broadcast according to the linguistic feature vector, hidden prosodic vector and voiceprint feature vector Voice spectrum information corresponding to the script text; generating live voice corresponding to the live script text according to the voice spectrum information.
  • the live broadcast script corresponding to the live broadcast object can be a live broadcast script corresponding to multiple live broadcast objects (such as commodities, or content or programs, etc.), such as the script of the whole live broadcast, or one or some corresponding scripts of multiple live broadcast objects. live script.
  • the above-mentioned method can be used to finally convert the live broadcast script into live voice, so as to be applied to live broadcast scenarios, such as live broadcast delivery or live content promotion, and so on.
  • the live broadcast voice can be adapted to a virtual anchor or a real anchor, and can be widely used in live broadcast scenarios.
  • the speech synthesis process may include: obtaining the script text to be broadcast; obtaining the phoneme vector, text vector, and voiceprint feature vector of the target vocal corresponding to the script text; obtaining the script according to the phoneme vector and the text vector The linguistic feature vector corresponding to the text; predict and obtain the hidden prosodic vector of the script text according to the text vector and the voiceprint feature vector; generate the voice spectrum information corresponding to the script text according to the linguistic feature vector, hidden prosodic vector and voiceprint feature vector; According to the voice spectrum information, the performance voice corresponding to the script text is generated.
  • the script text to be performed includes one of the following: line scripts corresponding to audio or video, and text content of e-books.
  • the above-mentioned method can be used to finally convert the script text into a performance voice, so as to be applied to the performance scene.
  • the studio voice can be used to dub video characters, or realize audio generation, or realize audio e-books and so on.
  • the phonemes of the text to be converted, the voiceprint features of the text and the target human voice are taken into consideration.
  • the linguistic features of the text to be converted can be obtained based on phonemes and texts, which carry the pronunciation features of the corresponding level of the text (such as character level, word level, sentence level, etc.); based on text and voiceprint features, it can be predicted and obtained
  • the hidden prosodic vector of the text to be converted which mainly contains prosody information. The prosody in this way is obtained based on the corresponding features of the text, and more attention is paid to the characteristics of the prosody itself.
  • the speech spectrum information obtained after processing based on linguistic features, hidden prosodic vectors and voiceprint features is more in line with the speech characteristics of the target human voice corresponding to the actual voiceprint features, and is closer to the actual rhythm of the target human voice. for close. As a result, the subsequent speech generated based on the obtained speech spectrum information is also closer to the actual human voice.
  • the prosody modeling is no longer based on the fundamental frequency, but the prosody information is extracted based on various information related to the prosody, which can make the extracted prosody more accurate; on the other hand
  • the comprehensive consideration of the relationship between various factors affecting the prosody also makes the prosody thus obtained more accurate.
  • FIG. 3A it shows a flowchart of steps of a data conversion method according to Embodiment 2 of the present application.
  • data conversion using a prosodic model is taken as an example, and a training process of the prosodic model is firstly introduced, and then data conversion is performed based on the trained prosody model.
  • Step S302 Obtain training samples, and use the training samples to train the prosodic model.
  • the training sample includes the text sample to be converted and the corresponding voice sample, and the voiceprint feature sample vector.
  • the voice sample uses a low-frequency voice sample, such as a voice sample with a frequency band of 0-2KHz (kilohertz) .
  • the low-frequency speech samples carry sufficient prosody-related information, which will not affect the training effect; on the other hand, removing the speech in frequency bands other than the low-frequency band can make the model structure simpler.
  • the full-band voice samples are also applicable to the solutions of the embodiments of the present application.
  • low-quality speech samples containing noise can also be used, and are no longer limited to high-quality speech samples. In this way, audio in video, conventional audio, broadcast audio, etc. can be used as speech samples in the embodiments of the present application, greatly improving The number and selection range of speech samples are enriched, and the acquisition cost of speech samples is reduced.
  • the prosody model is as shown in Figure 3B, which includes: phoneme encoding network (shown as Phoneme Encoder in the figure), text encoding network, prosody encoding network (shown as Prosody Encoder in the figure), hidden prosody vector prediction network ( The figure shows the LPV Predictor), the vector splicing layer (the figure shows the dotted box where the "+" sign is located), and the decoding network (the figure shows the Decoder where the dotted box is located).
  • phoneme encoding network shown as Phoneme Encoder in the figure
  • text encoding network shown as Prosody Encoder in the figure
  • prosody encoding network shown as Prosody Encoder in the figure
  • hidden prosody vector prediction network The figure shows the LPV Predictor
  • the vector splicing layer the figure shows the dotted box where the "+” sign is located
  • the decoding network the figure shows the Decoder where the dotted box is located.
  • the training of the prosodic model includes: inputting the phoneme corresponding to the text sample to be converted into the phoneme encoding network to obtain the corresponding phoneme sample vector; inputting the characters of the text sample to be converted into the text encoding network to obtain the corresponding character sample text vector ; Speech samples, phoneme sample vectors, character sample text vectors and voiceprint feature sample vectors are input into the prosody encoding network to obtain the corresponding first hidden prosodic sample vector (LPV shown in the prosody model in Fig. 3B); based on the phoneme sample vector , the character sample text vector, the voiceprint feature sample vector and the first hidden prosodic sample vector to train the prosodic model.
  • LUV hidden prosodic sample vector
  • the text encoding network is divided into a character encoding network (shown as a character-level Word Encoder) and a context encoding network (shown as a Context Encoder in the upper right corner).
  • a character encoding network shown as a character-level Word Encoder
  • a context encoding network shown as a Context Encoder in the upper right corner
  • inputting speech samples, phoneme sample vectors, character sample text vectors and voiceprint feature sample vectors into the prosodic encoding network, and obtaining the corresponding first hidden prosodic sample vectors can be realized as follows: speech samples, phoneme sample vectors, first character The sample text vector and the voiceprint feature sample vector are input into the prosody encoding network to obtain the corresponding first hidden prosody sample vector.
  • the decoding network part is also provided with a Length Regulator and a Linear Layer.
  • the Length Regulator is used to adjust the length of the linguistic feature sample vector, the first hidden prosody sample vector and the voiceprint feature sample vector, so that their lengths are consistent with the voice spectrum information.
  • the Linear Layer is used to linearize the output of the Decoder.
  • the training for the L-shaped dotted line box on the left side in Figure 3B includes: converting the text sequence of the input text sample to be converted into a phoneme sequence (shown as Phoneme in the figure) and a character sequence (shown as Word in the figure) , input phoneme encoding network Phoneme Encoder and character encoding network Word Encoder respectively. Then the phoneme sample vector Phoneme Eembedding is obtained through the Phoneme Encoder, and the first character sample text vector Word Eembedding is obtained through the Word Encoder. Furthermore, Phoneme Eembedding and Word Eembedding are summed to obtain the linguistic feature sample vector H_ling.
  • the mel-spec (mel-spec) of the sample human voice that is, the low-frequency part of the voice sample (such as the 0-2KHz part) is obtained through the prosody encoding network Prosody Encoder
  • the first hidden prosody vectors latent prosody vectors, LPV.
  • H_ling, H_spk and the first hidden prosodic sample vector are spliced together and sent to the subsequent decoding network to obtain the predicted mel spectrum.
  • the training process of the prosody encoding network Prosody Encoder can be exemplarily implemented as follows: through the first convolutional layer of the prosody encoding network based on the phoneme sample vector and the voiceprint feature sample vector, feature extraction is performed on the speech sample to obtain the first A prosodic sample feature; perform character-level pooling processing on the first prosodic sample feature through the pooling layer of the prosodic encoding network to obtain character-level prosodic sample features; pass the second convolutional layer of the prosodic encoding network based on the first character sample
  • the text vector and the voiceprint feature sample vector extract the character-level prosodic sample features to obtain the second prosodic sample features; the second prosodic sample features are vectorized through the vectorization layer of the prosody coding network to obtain the first hidden Rhythm sample vector.
  • the prosodic encoding network structure is simplified, and the hidden prosodic sample vector can be extracted effectively.
  • the input of the prosody encoding network Prosody Encoder is the low-frequency part of the Mel spectrum of the voice sample corresponding to the text sample to be converted, Phoneme Eembedding and Word Eembedding (for simplified expression, In the text, it is simply indicated as H_ling) and H_spk, and the output is the first hidden prosodic sample vector sequence at the character level.
  • the prosody encoding network Prosody Encoder contains two levels of Conv Stacks (convolution stack): when the first level of Conv Stacks processes the low frequency part of the Mel spectrum, the input includes Phoneme Eembedding and H_spk in addition to the low frequency part of the Mel spectrum.
  • the convolution processing of the low-frequency part of the Mel spectrum can filter out the influence of the phoneme on the prosody, and then the low-frequency part of the convolution-processed Mel spectrum is passed through the character-level pooling layer Word
  • the pooling operation of -level Pooling is compressed to the character level;
  • the second-level Conv Stacks is based on the output of the first-level Conv Stacks and Word Eembedding, H_spk to obtain the hidden prosodic expression, and the addition of Word Eembedding makes the low-frequency part of the Mel spectrum
  • the convolution processing of can filter out the impact of character semantics on prosody; finally, based on this hidden prosody expression, the first hidden prosody sample vector sequence at the character level is obtained through the vector quantization layer (Vector Quantization).
  • the prosodic model can be trained based on the phoneme sample vector, the first character sample text vector, the voiceprint feature sample vector and the first hidden prosodic sample vector. Specifically, it may include: adding the phoneme sample vector and the first character sample text vector through the vector splicing layer to obtain the linguistic feature vector; and, the linguistic feature vector, the voiceprint feature sample vector and the first hidden prosody sample The vectors are spliced to obtain the spliced sample vectors; the spliced sample vectors are decoded through the decoding network, and the prosody model is trained according to the decoding results.
  • the length regularization process can also be performed on the spliced sample vectors through the length regularization layer; Sample vector to decode. Specifically, it may be shown in (a) of FIG. 3B .
  • the prosody encoding network Prosody Encoder not only participates in the training of the left L-shaped dashed box in Figure 3B(a), but also undertakes the training task of the hidden prosody vector prediction network LPV Predictor.
  • the prosody prediction will be mainly realized by the LPV Predictor, and the prosody encoding network Prosody Encoder will no longer function.
  • the training of the prosodic model also includes: inputting the second character sample text vector and the voiceprint feature sample vector into the hidden prosodic vector prediction network, and predicting and obtaining the second hidden prosodic sample vector; according to the first hidden prosodic sample vector and the second hidden Differences in prosodic sample vectors to train the hidden prosodic vector prediction network.
  • the acquisition of the second character sample text vector can be realized by using the context encoding network Context Encoder in the upper right corner of Figure 3B, and its specific structure can adopt the BERT model structure.
  • the context encoding network Context Encoder in the upper right corner of Figure 3B, and its specific structure can adopt the BERT model structure.
  • other structures such as any plain text training model structure, are also applicable to the solutions of the embodiments of the present application.
  • FIG. 3B A simple schematic diagram of training the hidden rhythm vector prediction network is shown in the lower right corner of Fig. 3B. It can be seen that the prosody encoding network Prosody Encoder is based on the low-frequency part of the Mel spectrum of the speech sample (that is, the noise speech noisy audio shown in the lower right corner of Figure 3B), Phoneme Eembedding and Word Eembedding (to simplify the expression, the text is simply indicated as H_ling ) and H_spk, output the first hidden prosodic sample vector. LPV Predictor outputs the second hidden prosodic sample vector based on the character sequence of the text to be converted (shown as Word in the figure) and H_spk. In (d) of Fig.
  • the two hidden prosody sample vectors are both shown as LPV.
  • the LPV Predictor can be trained.
  • the loss function may be any appropriate function, including but not limited to a distance function such as a cosine distance function, which is not limited in this embodiment of the present application.
  • LPV Predictor is an autoregressive prediction model, which can be seen from (c) in Figure 3B.
  • LPV Predictor converts the input Word into a character vector through the Context Encoder.
  • the Context The character vector output by the Encoder is expressed as Hi; on the other hand, when the LPV Predictor processes the current character, it also uses the LPV corresponding to the previous character (LPV i-1 in the figure) as a reference.
  • the spliced vector is subjected to subsequent processing (such as normalization, convolution, etc.), where the normalization layer, for example, can be expressed as add&norm, where add means residual processing, norm means normalization processing, and the convolutional layer can be, for example, Conv1D (one-dimensional convolution), and finally obtain the prosody prediction result for the current character, that is, LPV i .
  • processing such as normalization, convolution, etc.
  • the normalization layer for example, can be expressed as add&norm, where add means residual processing, norm means normalization processing, and the convolutional layer can be, for example, Conv1D (one-dimensional convolution), and finally obtain the prosody prediction result for the current character, that is, LPV i .
  • the prediction process can be implemented as: inputting the second character sample text vector and the voiceprint feature sample vector corresponding to the current character to be predicted into the hidden prosodic vector prediction network; Perform feature fusion with the pattern feature sample vector and the second hidden prosodic sample vector corresponding to the previous character of the current character; based on the fused feature vector, predict and obtain the second hidden prosodic sample vector of the current character. More accurate prosodic information can be obtained through autoregressive methods.
  • Step S304 Obtain the phoneme vector, text vector and voiceprint feature vector of the target human voice corresponding to the text to be converted.
  • the phoneme sequence of the text to be converted is encoded into a phoneme vector Phoneme Eembedding through its phoneme encoding network Phoneme Encoder; the character sequence of the text to be converted is converted through its character encoding network Word Encoder Word Eembedding is a character text vector.
  • the voiceprint feature vector H_spk of the target voice can be obtained in advance, and the specific means of extracting the voiceprint feature vector based on the target voice is not limited in this embodiment of the present application.
  • Step S306 According to the phoneme vector and the text vector, obtain the linguistic feature vector corresponding to the text to be converted; according to the text vector and the voiceprint feature vector, predict and obtain the hidden prosodic vector of the text to be converted.
  • Step S308 Generate speech spectrum information corresponding to the text to be converted according to the linguistic feature vector, hidden prosody vector and voiceprint feature vector.
  • the corresponding speech can be output through the vocoder, realizing the conversion from text to speech.
  • the hidden prosody vector is used to represent the prosody instead of the prosody component, which avoids the poor effect of prosody modeling caused by the inaccurate fundamental frequency extraction and the lack of correlation in the prediction of each prosody component in the traditional method, and obtains The frequency spectrum effect of , which in turn leads to the problem of poor speech synthesis.
  • the prosody modeling is no longer based on the fundamental frequency, but the prosody information is extracted based on various information related to the prosody, which can make the extracted prosody more accurate; on the other hand, The comprehensive consideration of the relationship between various factors affecting the prosody (such as phoneme, text, voiceprint of the target human voice, etc.) also makes the prosody thus obtained more accurate.
  • FIG. 4 shows a schematic structural diagram of an electronic device according to Embodiment 3 of the present application.
  • the specific embodiment of the present application does not limit the specific implementation of the electronic device.
  • the electronic device may include: a processor (processor) 402, a communication interface (Communications Interface) 404, a memory (memory) 406, and a communication bus 408.
  • processor processor
  • Communication interface Communication Interface
  • memory memory
  • the processor 402 , the communication interface 404 , and the memory 406 communicate with each other through the communication bus 408 .
  • the communication interface 404 is used for communicating with other electronic devices or servers.
  • the processor 402 is configured to execute the program 410, and specifically, may execute relevant steps in the foregoing data conversion method embodiments.
  • the program 410 may include program codes including computer operation instructions.
  • the processor 402 may be a CPU, or an ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement the embodiments of the present application.
  • the one or more processors included in the smart device may be of the same type, such as one or more CPUs, or may be different types of processors, such as one or more CPUs and one or more ASICs.
  • the memory 406 is used to store the program 410 .
  • the memory 406 may include a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.
  • the program 410 may be specifically configured to enable the processor 402 to perform any operation described in any one of the foregoing data conversion method embodiments.
  • each step in the program 410 refers to the corresponding descriptions in the corresponding steps and units in the above-mentioned data conversion method embodiment and related method embodiments, and have corresponding beneficial effects, so details are not repeated here.
  • the specific working process of the above-described devices and modules can refer to the corresponding process description in the foregoing method embodiments, and details are not repeated here.
  • the embodiment of the present application also provides a data conversion device, including:
  • the obtaining module is used to obtain the voiceprint feature vector of the phoneme vector, text vector and target human voice corresponding to the text to be converted;
  • the obtaining module is also used to obtain a linguistic feature vector corresponding to the text to be converted according to the phoneme vector and the text vector;
  • a prediction module configured to predict and obtain the hidden prosody vector of the text to be converted according to the text vector and the voiceprint feature vector;
  • a generating module configured to generate speech spectrum information corresponding to the text to be converted according to the linguistic feature vector, the hidden prosodic vector, and the voiceprint feature vector.
  • the text vector is a character text vector corresponding to each character in the text to be converted.
  • the data conversion method is performed by a prosody model
  • the prosody model at least includes: a phoneme encoding network, a text encoding network, a hidden prosody vector prediction network, a vector splicing layer, and a decoding network;
  • the phoneme encoding network is used to obtain the phoneme vector corresponding to the text to be converted;
  • the text encoding network is used to obtain a text vector corresponding to the text to be converted
  • the hidden prosody vector prediction network is used to predict and obtain the hidden prosody vector of the text to be converted according to the text vector corresponding to the text to be converted and the acquired voiceprint feature vector of the target human voice;
  • the vector splicing layer is used to add the phoneme vector and the text vector to obtain the linguistic feature vector corresponding to the text to be converted; and, for the linguistic feature vector and the hidden prosody vector Splicing with the voiceprint feature vector to generate a splicing vector;
  • the decoding network is configured to decode the splicing vector to obtain speech spectrum information corresponding to the text to be converted.
  • the text encoding network includes a character encoding network and a context encoding network
  • the character encoding network is used to perform character-level encoding on the text to be converted, and generate a character text vector for adding to the phoneme vector;
  • the context encoding network is used to perform character-level encoding on the text to be converted, and generate a character text vector for inputting the hidden prosody vector prediction network together with the voiceprint feature vector.
  • the acquiring module is also used for:
  • the training sample includes a text sample to be converted, a corresponding voice sample, and a voiceprint feature sample vector, and the voice sample is a voice sample with a frequency band of 0-2 kHz;
  • the device also includes a processing module
  • the processing module is used to train the prosody model by using the training samples.
  • the prosodic model further includes a prosodic coding network
  • the processing module is specifically used for:
  • the prosody model is trained based on the phoneme sample vector, the character sample text vector, the voiceprint feature sample vector and the first hidden prosodic sample vector.
  • the processing module is specifically used for:
  • the characters of the text to be converted are respectively input into the character encoding network and the context encoding network to obtain corresponding first character sample text vectors and second character sample text vectors;
  • the inputting the speech sample, the phoneme sample vector, the character sample text vector and the voiceprint feature sample vector into the prosodic coding network to obtain the corresponding first hidden prosodic sample vector includes: , the phoneme sample vector, the first character sample text vector and the voiceprint feature sample vector are input into a prosodic encoding network to obtain a corresponding first hidden prosodic sample vector.
  • the processing module is specifically used for:
  • the second prosodic sample feature is vectorized by the vectorization layer of the prosodic encoding network to obtain a first hidden prosodic sample vector.
  • the processing module is specifically used for:
  • the hidden prosodic vector prediction network is trained according to the difference between the first hidden prosodic sample vector and the second hidden prosodic sample vector.
  • the embodiment of the present application also provides a data conversion device, including:
  • An acquisition module configured to acquire a response to a user instruction sent to the smart device, where the response contains text to be replied to the user instruction;
  • the obtaining module is also used to obtain the phoneme vector, text vector and voiceprint feature vector of the target human voice corresponding to the text to be replied;
  • the obtaining module is also used to obtain a linguistic feature vector corresponding to the text to be replied according to the phoneme vector and the text vector;
  • a prediction module configured to predict and obtain the hidden prosody vector of the text to be replied according to the text vector and the voiceprint feature vector;
  • a generation module configured to generate speech spectrum information corresponding to the text to be replied according to the linguistic feature vector, the hidden prosody vector and the voiceprint feature vector;
  • a processing module configured to generate and play the voice corresponding to the text to be replied according to the voice spectrum information.
  • the embodiment of the present application also provides a data conversion device, including:
  • An acquisition module configured to acquire the live script text corresponding to the object to be broadcasted
  • the acquiring module is also used to acquire the phoneme vector, text vector and voiceprint feature vector of the target human voice corresponding to the live script text;
  • the obtaining module is also used to obtain a linguistic feature vector corresponding to the live script text according to the phoneme vector and the text vector;
  • a prediction module configured to predict and obtain the hidden prosody vector of the live script text according to the text vector and the voiceprint feature vector;
  • a generating module configured to generate voice spectrum information corresponding to the live script text according to the linguistic feature vector, the hidden prosody vector and the voiceprint feature vector;
  • a processing module configured to generate live speech corresponding to the live script text according to the speech spectrum information.
  • the embodiment of the present application also provides a data conversion device, including:
  • the obtaining module is used to obtain the script text to be played, wherein the script text to be played includes one of the following: line scripts corresponding to audio or video, and text content of e-books;
  • the acquisition module is also used to acquire the phoneme vector, text vector and voiceprint feature vector of the target human voice corresponding to the script text;
  • the obtaining module is also used to obtain a linguistic feature vector corresponding to the script text according to the phoneme vector and the text vector;
  • a prediction module configured to predict and obtain the hidden prosody vector of the script text according to the text vector and the voiceprint feature vector
  • a generating module configured to generate speech spectrum information corresponding to the script text according to the linguistic feature vector, the hidden prosody vector and the voiceprint feature vector;
  • a processing module configured to generate a performance voice corresponding to the script text according to the voice spectrum information.
  • the embodiment of the present application also provides a data conversion device, including:
  • An acquisition module configured to acquire a phoneme vector corresponding to the text to be converted through the phoneme encoding network of the prosodic model; and acquire a text vector corresponding to the text to be converted through the text encoding network of the prosodic model;
  • the prediction module is used to predict and obtain the hidden prosodic vector of the text to be converted through the hidden prosodic vector prediction network of the prosody model according to the text vector corresponding to the text to be converted and the acquired voiceprint feature vector of the target human voice;
  • the acquisition module is also used to add the phoneme vector and the text vector through the vector splicing layer of the prosody model to obtain the linguistic feature vector corresponding to the text to be converted;
  • a generating module configured to splice the linguistic feature vector, the hidden prosody vector, and the voiceprint feature vector to generate a stitching vector
  • a processing module configured to decode the concatenation vector through the decoding network of the prosody model to obtain speech spectrum information corresponding to the text to be converted.
  • An embodiment of the present application further provides a computer program product, including computer instructions, where the computer instruction instructs a computing device to perform operations corresponding to any one of the data conversion methods in the above multiple method embodiments.
  • the input of the prosodic coding network in multiple embodiments of the present application is exemplified by Mel spectrum, but not limited thereto, other acoustic features (such as LPC fea, MFCC, fbank, raw wave, etc.) are also applicable .
  • each component/step described in the embodiment of the present application can be divided into more components/steps, and two or more components/steps or partial operations of components/steps can also be combined into New components/steps to achieve the purpose of the embodiment of the present application.
  • the above method according to the embodiment of the present application can be implemented in hardware, firmware, or as software or computer code that can be stored in a recording medium (such as CD ROM, RAM, floppy disk, hard disk, or magneto-optical disk), or implemented by Computer code downloaded from a network that is originally stored on a remote recording medium or a non-transitory machine-readable medium and will be stored on a local recording medium so that the methods described herein can be stored on a computer code using a general-purpose computer, a dedicated processor, or a programmable Such software processing on a recording medium of dedicated hardware such as ASIC or FPGA.
  • a recording medium such as CD ROM, RAM, floppy disk, hard disk, or magneto-optical disk
  • Computer code downloaded from a network that is originally stored on a remote recording medium or a non-transitory machine-readable medium and will be stored on a local recording medium so that the methods described herein can be stored on a computer code using a general-purpose computer, a dedicated processor, or
  • a computer, processor, microprocessor controller, or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when When accessed and executed by a processor or hardware, the data conversion methods described herein are implemented.
  • memory components e.g., RAM, ROM, flash memory, etc.
  • the execution of the code converts the general-purpose computer into a special-purpose computer for executing the data conversion method shown here.

Abstract

A data conversion method and a corresponding apparatus thereof, an electronic device, a computer storage medium, and a computer program product. The data conversion method comprises: obtaining a phoneme vector, a text vector and a voiceprint feature vector of a target voice corresponding to a text to be converted; obtaining, according to the phoneme vector and the text vector, a linguistic feature vector corresponding to said text; and predicting, according to the text vector and the voiceprint feature vector, to obtain a hidden rhythm vector of said text; and generating, according to the linguistic feature vector, the hidden rhythm vector and the voiceprint feature vector, speech spectrum information corresponding to said text. According to the method, the rhythm determined for the text to be converted into speech can be more accurate.

Description

数据转换方法及计算机存储介质Data conversion method and computer storage medium
本申请要求于2021年12月20日提交中国专利局、申请号为202111559250.5、申请名称为“数据转换方法及计算机存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application with application number 202111559250.5 and application title "Data conversion method and computer storage medium" filed with the China Patent Office on December 20, 2021, the entire contents of which are incorporated herein by reference .
技术领域technical field
本申请实施例涉及计算机技术领域,尤其涉及一种数据转换方法及计算机存储介质。The embodiments of the present application relate to the field of computer technologies, and in particular, to a data conversion method and a computer storage medium.
背景技术Background technique
语音合成技术又称文语转换(Text to Speech)技术,其能将文字信息转化为标准流畅的语音,相当于给机器装上了人工嘴巴。而要达到更近似人声的效果,则需要高表现力语音合成,该种语音合成需要对韵律进行建模,通过韵律模型提升语音合成的表现力。Speech synthesis technology, also known as Text to Speech (Text to Speech) technology, can convert text information into standard and smooth speech, which is equivalent to installing an artificial mouth on a machine. To achieve an effect that is more similar to the human voice, high-expressive speech synthesis is required. This kind of speech synthesis needs to model prosody, and the expressiveness of speech synthesis is improved through the prosody model.
一般来说,韵律成分包含:基频、能量和时长。现有的韵律建模通常基于韵律的基频特征构建,但一方面,由于基频提取不准,导致韵律建模效果差,进一步导致由此获取到的韵律信息也不准确;另一方面,未考虑到影响韵律的因素之间的关联性,也造成韵律建模效果差,获取的韵律信息不准确。In general, prosodic components include: fundamental frequency, energy, and duration. Existing prosody modeling is usually constructed based on the fundamental frequency features of the prosody, but on the one hand, due to the inaccurate extraction of the fundamental frequency, the effect of prosody modeling is poor, which further leads to inaccurate prosody information obtained from it; on the other hand, Failure to take into account the correlation between factors affecting prosody also results in poor prosody modeling and inaccurate prosody information.
因此,基于目前的韵律建模方式所获取的韵律信息,存在准确性不佳的问题。Therefore, the prosody information obtained based on the current prosody modeling method has the problem of poor accuracy.
发明内容Contents of the invention
有鉴于此,本申请实施例提供一种数据转换方案,以至少部分解决上述问题。In view of this, an embodiment of the present application provides a data conversion solution to at least partially solve the above problem.
根据本申请实施例的第一方面,提供了一种数据转换方法,包括:According to the first aspect of the embodiments of the present application, a data conversion method is provided, including:
获取待转换文本对应的音素向量、文本向量和目标人声的声纹特征向量;根据所述音素向量和所述文本向量,获得所述待转换文本对应的语言学特征向量;根据所述文本向量和所述声纹特征向量,预测获得所述待转换文本的隐藏韵律矢量;根据所述语言学特征向量、所述隐藏韵律矢量和所述声纹特征向量,生成所述待转换文本对应的语音频谱信息。Obtaining the phoneme vector, text vector, and voiceprint feature vector of the target human voice corresponding to the text to be converted; according to the phoneme vector and the text vector, obtaining the linguistic feature vector corresponding to the text to be converted; according to the text vector and the voiceprint feature vector, predicting and obtaining the hidden prosodic vector of the text to be converted; generating the corresponding voice of the text to be converted according to the linguistic feature vector, the hidden prosody vector and the voiceprint feature vector spectrum information.
根据本申请实施例的第二方面,提供了一种数据转换方法,包括:According to a second aspect of the embodiments of the present application, a data conversion method is provided, including:
获取向智能设备发送的用户指令的响应,所述响应中包含有针对所述用户指令的待回复文本;Obtaining a response to the user instruction sent to the smart device, the response includes text to be replied to the user instruction;
获取所述待回复文本对应的音素向量、文本向量和目标人声的声纹特征向量;Acquiring the phoneme vector, text vector and voiceprint feature vector of the target human voice corresponding to the text to be replied;
根据所述音素向量和所述文本向量,获得所述待回复文本对应的语言学特征向量;根据所述文本向量和所述声纹特征向量,预测获得所述待回复文本的隐藏韵律矢量;Obtaining a linguistic feature vector corresponding to the text to be replied according to the phoneme vector and the text vector; predicting and obtaining a hidden prosody vector of the text to be replied according to the text vector and the voiceprint feature vector;
根据所述语言学特征向量、所述隐藏韵律矢量和所述声纹特征向量,生成所述待回复文本对应的语音频谱信息;generating speech spectrum information corresponding to the text to be replied according to the linguistic feature vector, the hidden prosody vector and the voiceprint feature vector;
根据所述语音频谱信息生成所述待回复文本对应的语音并播放。Generate and play the voice corresponding to the text to be replied according to the voice spectrum information.
根据本申请实施例的第三方面,提供了一种数据转换方法,包括:According to a third aspect of the embodiments of the present application, a data conversion method is provided, including:
获取待直播对象对应的直播剧本文本;Obtain the live script text corresponding to the object to be broadcasted;
获取所述直播剧本文本对应的音素向量、文本向量和目标人声的声纹特征向量;Obtaining the phoneme vector, text vector and voiceprint feature vector of the target human voice corresponding to the live script text;
根据所述音素向量和所述文本向量,获得所述直播剧本文本对应的语言学特征向量;根据所述文本向量和所述声纹特征向量,预测获得所述直播剧本文本的隐藏韵律矢量;Obtaining a linguistic feature vector corresponding to the live script text according to the phoneme vector and the text vector; predicting and obtaining a hidden prosody vector of the live script text according to the text vector and the voiceprint feature vector;
根据所述语言学特征向量、所述隐藏韵律矢量和所述声纹特征向量,生成所述直播剧本文本对应的语音频谱信息;generating voice spectrum information corresponding to the live script text according to the linguistic feature vector, the hidden prosody vector and the voiceprint feature vector;
根据所述语音频谱信息生成所述直播剧本文本对应的直播语音。Generate live voice corresponding to the live script text according to the voice spectrum information.
根据本申请实施例的第四方面,提供了一种数据转换方法,包括:According to a fourth aspect of the embodiments of the present application, a data conversion method is provided, including:
获取待演播的剧本文本,其中,所述待演播的剧本文本包括以下之一:音频或视频对应的台词剧本、电子书文本内容;Obtain the script text to be played, wherein, the script text to be played includes one of the following: audio or video corresponding line script, e-book text content;
获取所述剧本文本对应的音素向量、文本向量和目标人声的声纹特征向量;Acquiring the phoneme vector, text vector and voiceprint feature vector of the target human voice corresponding to the script text;
根据所述音素向量和所述文本向量,获得所述剧本文本对应的语言学特征向量;根据所述文本向量和所述声纹特征向量,预测获得所述剧本文本的隐藏韵律矢量;Obtaining a linguistic feature vector corresponding to the script text according to the phoneme vector and the text vector; predicting and obtaining a hidden prosody vector of the script text according to the text vector and the voiceprint feature vector;
根据所述语言学特征向量、所述隐藏韵律矢量和所述声纹特征向量,生成所述剧本文本对应的语音频谱信息;generating speech spectrum information corresponding to the script text according to the linguistic feature vector, the hidden prosody vector and the voiceprint feature vector;
根据所述语音频谱信息生成所述剧本文本对应的演播语音。Generate the performance voice corresponding to the script text according to the voice spectrum information.
根据本申请实施例的第五方面,提供了一种数据转换方法,包括:According to a fifth aspect of the embodiments of the present application, a data conversion method is provided, including:
通过韵律模型的音素编码网络获取待转换文本对应的音素向量;并且,通过所述韵律模型的文本编码网络获取所述待转换文本对应的文本向量;Obtaining the phoneme vector corresponding to the text to be converted through the phoneme coding network of the prosodic model; and obtaining the text vector corresponding to the text to be converted through the text coding network of the prosodic model;
通过所述韵律模型的隐藏韵律矢量预测网络根据所述待转换文本对应的文本向量和获取的目标人声的声纹特征向量,预测获得所述待转换文本的隐藏韵律矢量;Predicting and obtaining the hidden prosodic vector of the text to be converted through the hidden prosodic vector prediction network of the prosody model according to the text vector corresponding to the text to be converted and the acquired voiceprint feature vector of the target human voice;
通过所述韵律模型的向量拼接层对所述音素向量和所述文本向量进行加和,获得所述待转换文本对应的语言学特征向量;以及,对所述语言学特征向量、所述隐藏韵律矢量和所述声纹特征向量进行拼接,生成拼接向量;Adding the phoneme vector and the text vector through the vector splicing layer of the prosody model to obtain a linguistic feature vector corresponding to the text to be converted; and, for the linguistic feature vector, the hidden prosody The vector and the voiceprint feature vector are spliced to generate a spliced vector;
通过所述韵律模型的解码网络对所述拼接向量进行解码,获得所述待转换文本对应的语音频谱信息。The splicing vector is decoded by the decoding network of the prosody model to obtain the speech spectrum information corresponding to the text to be converted.
根据本申请实施例的第六方面,提供了一种数据转换装置,包括:According to a sixth aspect of the embodiments of the present application, a data conversion device is provided, including:
获取模块,用于获取待转换文本对应的音素向量、文本向量和目标人声的声纹特征向量;The obtaining module is used to obtain the voiceprint feature vector of the phoneme vector, text vector and target human voice corresponding to the text to be converted;
所述获取模块还用于,根据所述音素向量和所述文本向量,获得所述待转换文本对应的语言学特征向量;The obtaining module is also used to obtain a linguistic feature vector corresponding to the text to be converted according to the phoneme vector and the text vector;
预测模块,用于根据所述文本向量和所述声纹特征向量,预测获得所述待转换文本的隐藏韵律矢量;A prediction module, configured to predict and obtain the hidden prosody vector of the text to be converted according to the text vector and the voiceprint feature vector;
生成模块,用于根据所述语言学特征向量、所述隐藏韵律矢量和所述声纹特征向量,生成所述待转换文本对应的语音频谱信息。A generating module, configured to generate speech spectrum information corresponding to the text to be converted according to the linguistic feature vector, the hidden prosodic vector, and the voiceprint feature vector.
根据本申请实施例的第七方面,提供了一种数据转换装置,包括:According to a seventh aspect of the embodiments of the present application, a data conversion device is provided, including:
获取模块,用于获取向智能设备发送的用户指令的响应,所述响应中包含有针对所述用户指令的待回复文本;An acquisition module, configured to acquire a response to a user instruction sent to the smart device, where the response contains text to be replied to the user instruction;
所述获取模块还用于,获取所述待回复文本对应的音素向量、文本向量和目标人声的声纹特征向量;The obtaining module is also used to obtain the phoneme vector, text vector and voiceprint feature vector of the target human voice corresponding to the text to be replied;
所述获取模块还用于,根据所述音素向量和所述文本向量,获得所述待回复文本对应的语言学特征向量;The obtaining module is also used to obtain a linguistic feature vector corresponding to the text to be replied according to the phoneme vector and the text vector;
预测模块,用于根据所述文本向量和所述声纹特征向量,预测获得所述待回复文 本的隐藏韵律矢量;A prediction module, configured to predict and obtain the hidden prosody vector of the text to be replied according to the text vector and the voiceprint feature vector;
生成模块,用于根据所述语言学特征向量、所述隐藏韵律矢量和所述声纹特征向量,生成所述待回复文本对应的语音频谱信息;A generation module, configured to generate speech spectrum information corresponding to the text to be replied according to the linguistic feature vector, the hidden prosody vector and the voiceprint feature vector;
处理模块,用于根据所述语音频谱信息生成所述待回复文本对应的语音并播放。A processing module, configured to generate and play the voice corresponding to the text to be replied according to the voice spectrum information.
根据本申请实施例的第八方面,提供了一种数据转换装置,包括:According to an eighth aspect of the embodiments of the present application, a data conversion device is provided, including:
获取模块,用于获取待直播对象对应的直播剧本文本;An acquisition module, configured to acquire the live script text corresponding to the object to be broadcasted;
所述获取模块还用于,获取所述直播剧本文本对应的音素向量、文本向量和目标人声的声纹特征向量;The acquiring module is also used to acquire the phoneme vector, text vector and voiceprint feature vector of the target human voice corresponding to the live script text;
所述获取模块还用于,根据所述音素向量和所述文本向量,获得所述直播剧本文本对应的语言学特征向量;The obtaining module is also used to obtain a linguistic feature vector corresponding to the live script text according to the phoneme vector and the text vector;
预测模块,用于根据所述文本向量和所述声纹特征向量,预测获得所述直播剧本文本的隐藏韵律矢量;A prediction module, configured to predict and obtain the hidden prosody vector of the live script text according to the text vector and the voiceprint feature vector;
生成模块,用于根据所述语言学特征向量、所述隐藏韵律矢量和所述声纹特征向量,生成所述直播剧本文本对应的语音频谱信息;A generating module, configured to generate voice spectrum information corresponding to the live script text according to the linguistic feature vector, the hidden prosody vector and the voiceprint feature vector;
处理模块,用于根据所述语音频谱信息生成所述直播剧本文本对应的直播语音。A processing module, configured to generate live speech corresponding to the live script text according to the speech spectrum information.
根据本申请实施例的第九方面,提供了一种数据转换装置,包括:According to a ninth aspect of the embodiments of the present application, a data conversion device is provided, including:
获取模块,用于获取待演播的剧本文本,其中,所述待演播的剧本文本包括以下之一:音频或视频对应的台词剧本、电子书文本内容;The obtaining module is used to obtain the script text to be played, wherein the script text to be played includes one of the following: line scripts corresponding to audio or video, and text content of e-books;
所述获取模块还用于,获取所述剧本文本对应的音素向量、文本向量和目标人声的声纹特征向量;The acquisition module is also used to acquire the phoneme vector, text vector and voiceprint feature vector of the target human voice corresponding to the script text;
所述获取模块还用于,根据所述音素向量和所述文本向量,获得所述剧本文本对应的语言学特征向量;The obtaining module is also used to obtain a linguistic feature vector corresponding to the script text according to the phoneme vector and the text vector;
预测模块,用于根据所述文本向量和所述声纹特征向量,预测获得所述剧本文本的隐藏韵律矢量;A prediction module, configured to predict and obtain the hidden prosody vector of the script text according to the text vector and the voiceprint feature vector;
生成模块,用于根据所述语言学特征向量、所述隐藏韵律矢量和所述声纹特征向量,生成所述剧本文本对应的语音频谱信息;A generating module, configured to generate speech spectrum information corresponding to the script text according to the linguistic feature vector, the hidden prosody vector and the voiceprint feature vector;
处理模块,用于根据所述语音频谱信息生成所述剧本文本对应的演播语音。A processing module, configured to generate a performance voice corresponding to the script text according to the voice spectrum information.
根据本申请实施例的第十方面,提供了一种数据转换装置,包括:According to a tenth aspect of the embodiments of the present application, a data conversion device is provided, including:
获取模块,用于通过韵律模型的音素编码网络获取待转换文本对应的音素向量;并且,通过所述韵律模型的文本编码网络获取所述待转换文本对应的文本向量;An acquisition module, configured to acquire a phoneme vector corresponding to the text to be converted through the phoneme encoding network of the prosodic model; and acquire a text vector corresponding to the text to be converted through the text encoding network of the prosodic model;
预测模块,用于通过所述韵律模型的隐藏韵律矢量预测网络根据所述待转换文本对应的文本向量和获取的目标人声的声纹特征向量,预测获得所述待转换文本的隐藏韵律矢量;The prediction module is used to predict and obtain the hidden prosodic vector of the text to be converted through the hidden prosodic vector prediction network of the prosody model according to the text vector corresponding to the text to be converted and the acquired voiceprint feature vector of the target human voice;
所述获取模块还用于,通过所述韵律模型的向量拼接层对所述音素向量和所述文本向量进行加和,获得所述待转换文本对应的语言学特征向量;The acquisition module is also used to add the phoneme vector and the text vector through the vector splicing layer of the prosody model to obtain the linguistic feature vector corresponding to the text to be converted;
生成模块,用于对所述语言学特征向量、所述隐藏韵律矢量和所述声纹特征向量进行拼接,生成拼接向量;A generating module, configured to splice the linguistic feature vector, the hidden prosody vector, and the voiceprint feature vector to generate a stitching vector;
处理模块,用于通过所述韵律模型的解码网络对所述拼接向量进行解码,获得所述待转换文本对应的语音频谱信息。A processing module, configured to decode the concatenation vector through the decoding network of the prosody model to obtain speech spectrum information corresponding to the text to be converted.
根据本申请实施例的第十一方面,提供了一种电子设备,包括:According to an eleventh aspect of the embodiments of the present application, an electronic device is provided, including:
存储器,用于存储程序;memory for storing programs;
处理器,用于执行所述存储器存储的所述程序,当所述程序被执行时,所述处理器用于执行如第一方面至第五方面中任一所述的数据转换方法。A processor, configured to execute the program stored in the memory, and when the program is executed, the processor is configured to execute the data conversion method according to any one of the first aspect to the fifth aspect.
根据本申请实施例的第十二方面,提供了一种计算机存储介质,其上存储有计算机程序,该程序被处理器执行时实现如第一方面至第五方面中任一所述的数据转换方法。According to a twelfth aspect of the embodiments of the present application, there is provided a computer storage medium, on which a computer program is stored, and when the program is executed by a processor, the data conversion as described in any one of the first aspect to the fifth aspect is realized method.
根据本申请实施例的第十三方面,提供了一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时实现如上第一方面至第五方面中任一所述的数据转换方法。According to the thirteenth aspect of the embodiments of the present application, there is provided a computer program product, including a computer program, when the computer program is executed by a processor, it implements the data conversion method described in any one of the first to fifth aspects above .
根据本申请实施例提供的数据转换方案,在获取需要转换为语音的待转换文本的频谱时,综合考虑了待转换文本的音素、文本和目标人声的声纹特征。其中,基于音素和文本可以获得待转换文本的语言学特征,该特征携带有文本所对应的级别(如字符级别、词级别、句子级别等)的发音特征;基于文本和声纹特征可以预测获得待转换文本的隐藏韵律矢量,该矢量主要包含韵律的信息,采用这种方式下的韵律基于文本对应的特征获得,更为关注韵律自身的特性。而基于语言学特征、隐藏韵律矢量和声纹特征经处理最终获得的语音频谱信息,则更为贴合实际的声纹特征对应的目标人声的语音特点,与实际的目标人声的韵律更为接近。由此,使得后续基于获得的语音频谱信息生成的语音也更与实际人声相近。According to the data conversion scheme provided by the embodiment of the present application, when acquiring the frequency spectrum of the text to be converted that needs to be converted into speech, the phonemes of the text to be converted, the voiceprint features of the text and the target human voice are taken into consideration. Among them, the linguistic features of the text to be converted can be obtained based on phonemes and texts, which carry the pronunciation features of the corresponding level of the text (such as character level, word level, sentence level, etc.); based on text and voiceprint features, it can be predicted and obtained The hidden prosodic vector of the text to be converted, which mainly contains prosody information. The prosody in this way is obtained based on the corresponding features of the text, and more attention is paid to the characteristics of the prosody itself. The speech spectrum information obtained after processing based on linguistic features, hidden prosodic vectors and voiceprint features is more in line with the speech characteristics of the target human voice corresponding to the actual voiceprint features, and is closer to the actual rhythm of the target human voice. for close. As a result, the subsequent speech generated based on the obtained speech spectrum information is also closer to the actual human voice.
可见,通过本申请实施例的方案,一方面,不再基于基频进行韵律建模,而是依据与韵律相关的多种信息进行韵律信息的提取,能够使得提取出的韵律更为准确;另一方面,综合考虑了影响韵律的多种因素(如音素、文本、目标人声的声纹等)之间的关系,也使得由此获得的韵律更为准确。It can be seen that, through the solution of the embodiment of the present application, on the one hand, the prosody modeling is no longer based on the fundamental frequency, but the prosody information is extracted based on various information related to the prosody, which can make the extracted prosody more accurate; on the other hand On the one hand, the comprehensive consideration of the relationship among various factors affecting the prosody (such as phoneme, text, voiceprint of the target human voice, etc.) also makes the prosody thus obtained more accurate.
附图说明Description of drawings
图1为适用本申请实施例的数据转换方法的示例性系统的示意图;FIG. 1 is a schematic diagram of an exemplary system applicable to the data conversion method of the embodiment of the present application;
图2A为根据本申请实施例一的一种数据转换方法的步骤流程图;FIG. 2A is a flowchart of steps of a data conversion method according to Embodiment 1 of the present application;
图2B为图2A所示实施例中的一种模型示例的示意图;Fig. 2B is a schematic diagram of a model example in the embodiment shown in Fig. 2A;
图2C为图2A所示实施例中的一种场景示例的示意图;FIG. 2C is a schematic diagram of a scenario example in the embodiment shown in FIG. 2A;
图3A为根据本申请实施例二的一种数据转换方法的步骤流程图;FIG. 3A is a flowchart of steps of a data conversion method according to Embodiment 2 of the present application;
图3B为图3A所示实施例中的一种模型及其训练过程示例的示意图;Fig. 3B is a schematic diagram of a model and an example of its training process in the embodiment shown in Fig. 3A;
图4为根据本申请实施例三的一种电子设备的结构示意图。FIG. 4 is a schematic structural diagram of an electronic device according to Embodiment 3 of the present application.
具体实施方式Detailed ways
为了使本领域的人员更好地理解本申请实施例中的技术方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅是本申请实施例一部分实施例,而不是全部的实施例。基于本申请实施例中的实施例,本领域普通技术人员所获得的所有其他实施例,都应当属于本申请实施例保护的范围。In order to enable those skilled in the art to better understand the technical solutions in the embodiments of the present application, the following will clearly and completely describe the technical solutions in the embodiments of the present application in conjunction with the drawings in the embodiments of the present application. Obviously, the described The embodiments are only some of the embodiments of the present application, but not all of them. All other embodiments obtained by persons of ordinary skill in the art based on the embodiments in the embodiments of the present application shall fall within the protection scope of the embodiments of the present application.
下面结合本申请实施例附图进一步说明本申请实施例具体实现。The specific implementation of the embodiment of the present application will be further described below in conjunction with the accompanying drawings of the embodiment of the present application.
图1示出了一种适用本申请实施例的数据转换方法的示例性系统。如图1所示,该系统100可以包括服务器102、通信网络104和/或一个或多个用户设备106,图1中示例为多个用户设备。Fig. 1 shows an exemplary system applicable to the data conversion method of the embodiment of the present application. As shown in FIG. 1, the system 100 may include a server 102, a communication network 104, and/or one or more user devices 106, exemplified in FIG. 1 as a plurality of user devices.
服务器102可以是用于存储信息、数据、程序和/或任何其他合适类型的内容的任何适当的服务器。在一些实施例中,服务器102可以执行任何适当的功能。例如,在一些实施例中,服务器102可以用于确定语音合成过程中需要使用的语音频谱信息。作为可选的示例,在一些实施例中,服务器102可以被用于基于待转换文本确定其对应的语音频谱信息,进而基于语音频谱信息进行语音合成。作为另一示例,在一些实施例中,服务器102可以基于待转换文本对应的音素、文本和目标人声的声纹确定其对应的语音频谱信息。 Server 102 may be any suitable server for storing information, data, programs, and/or any other suitable type of content. In some embodiments, server 102 may perform any suitable function. For example, in some embodiments, the server 102 may be used to determine the speech spectrum information to be used in the speech synthesis process. As an optional example, in some embodiments, the server 102 may be used to determine the corresponding speech spectrum information based on the text to be converted, and then perform speech synthesis based on the speech spectrum information. As another example, in some embodiments, the server 102 may determine the corresponding voice spectrum information based on the phoneme corresponding to the text to be converted, the text, and the voiceprint of the target human voice.
在一些实施例中,通信网络104可以是一个或多个有线和/或无线网络的任何适当的组合。例如,通信网络104能够包括以下各项中的任何一种或多种:互联网、内联网、广域网(Wide Area Network,WAN)、局域网(Local Area Network,LAN)、无线网络、数字订户线路(Digital Subscriber Line,DSL)网络、帧中继网络、异步转移模式(Asynchronous Transfer Mode,ATM)网络、虚拟专用网(Virtual Private Network,VPN)和/或任何其它合适的通信网络。用户设备106能够通过一个或多个通信链路(例如,通信链路112)连接到通信网络104,该通信网络104能够经由一个或多个通信链路(例如,通信链路114)被链接到服务器102。通信链路可以是适合于在用户设备106和服务器102之间传送数据的任何通信链路,诸如网络链路、拨号链路、无线链路、硬连线链路、任何其它合适的通信链路或此类链路的任何合适的组合。In some embodiments, communication network 104 may be any suitable combination of one or more wired and/or wireless networks. For example, communication network 104 can include any one or more of the following: Internet, Intranet, Wide Area Network (Wide Area Network, WAN), Local Area Network (Local Area Network, LAN), wireless network, Digital Subscriber Line (DSL) Subscriber Line (DSL) network, frame relay network, asynchronous transfer mode (Asynchronous Transfer Mode, ATM) network, virtual private network (Virtual Private Network, VPN) and/or any other suitable communication network. User device 106 can be connected to communication network 104 via one or more communication links (eg, communication link 112), which can be linked to Server 102. The communication link may be any communication link suitable for transferring data between the user device 106 and the server 102, such as a network link, a dial-up link, a wireless link, a hardwired link, any other suitable communication link or any suitable combination of such links.
用户设备106可以包括适合于呈现界面以进行信息输入和输出、播放语音的任何一个或多个用户设备。在一些实施例中,用户设备106可以包括任何合适类型的设备。例如,在一些实施例中,用户设备106可以包括IOT设备、移动设备、平板计算机、膝上型计算机、台式计算机、可穿戴计算机、游戏控制台、媒体播放器、车辆娱乐系统和/或任何其他合适类型的用户设备。注意,在一些实施例中,用户设备106若具有较高的软硬件性能,也可替代实现服务器102的功能。The user equipment 106 may include any one or more user equipment suitable for presenting an interface for information input and output, and for playing voice. In some embodiments, user equipment 106 may comprise any suitable type of equipment. For example, in some embodiments, user devices 106 may include IoT devices, mobile devices, tablet computers, laptop computers, desktop computers, wearable computers, game consoles, media players, vehicle entertainment systems, and/or any other Appropriate type of user equipment. Note that, in some embodiments, if the user equipment 106 has higher software and hardware performance, it can also replace the function of the server 102 .
尽管将服务器102图示为一个设备,但是在一些实施例中,可以使用任何适当数量的设备来执行由服务器102执行的功能。例如,在一些实施例中,可以使用多个设备来实现由服务器102执行的功能。或者,可使用云服务实现服务器102的功能。Although server 102 is illustrated as one device, in some embodiments any suitable number of devices may be used to perform the functions performed by server 102 . For example, in some embodiments, multiple devices may be used to implement the functions performed by server 102 . Alternatively, the functions of the server 102 may be implemented using cloud services.
基于上述系统,本申请实施例提供了一种数据转换方法,以下通过多个实施例进行说明。Based on the above system, an embodiment of the present application provides a data conversion method, which will be described below through multiple embodiments.
实施例一Embodiment one
参照图2A,示出了根据本申请实施例一的一种数据转换方法的步骤流程图。Referring to FIG. 2A , it shows a flowchart of steps of a data conversion method according to Embodiment 1 of the present application.
本实施例的数据转换方法包括以下步骤:The data conversion method of the present embodiment includes the following steps:
步骤S202:获取待转换文本对应的音素向量、文本向量和目标人声的声纹特征向量。Step S202: Obtain the phoneme vector, text vector and voiceprint feature vector of the target human voice corresponding to the text to be converted.
其中,音素是根据语音的自然属性划分出来的最小语音单位,依据音节里的发音动作来分析,一个动作构成一个音素。比如,阿(a)只有一个音素,个(ge)有两个音素,等等。一般来说,在将文本转换为语音的过程中,音素是重要的考量和转换依据。而在具体的转换过程中,需要确定将文本具体转换为什么样的人声,因此需要通过声纹特征作为参考,以使最终生成近似目标人声的语音。Among them, a phoneme is the smallest unit of speech divided according to the natural properties of speech. It is analyzed according to the pronunciation action in a syllable, and an action constitutes a phoneme. For example, 阿 (a) has only one phoneme, 个 (ge) has two phonemes, and so on. In general, phonemes are an important consideration and conversion basis in the process of converting text to speech. In the specific conversion process, it is necessary to determine what kind of human voice to convert the text into. Therefore, the voiceprint feature needs to be used as a reference to finally generate a voice similar to the target human voice.
此外,本申请实施例中还使用了待转换文本的文本向量。在实际应用中,文本向 量可以采用不同的级别,如,音素级别、字符级别、词级别、子句级别、句子级别等等。文本向量与其它用于生成韵律的向量如音素向量、声纹特征向量等具有较大的关联性,通过文本向量可以为后续生成与韵律相关的向量提供更为丰富的参考信息,包括但不限于文本信息和/或语义信息等。较优地,文本向量可采用字符级别,一方面与音素向量的对应性更好,另一方面,使用较为简单的网络结构即可实现,降低了方案实现复杂度和实现成本。In addition, the text vector of the text to be converted is also used in the embodiment of the present application. In practical applications, text vectors can adopt different levels, such as phoneme level, character level, word level, clause level, sentence level, etc. Text vectors are highly correlated with other vectors used to generate prosody, such as phoneme vectors and voiceprint feature vectors. Text vectors can provide richer reference information for subsequent generation of prosody-related vectors, including but not limited to Text information and/or semantic information, etc. Preferably, the text vector can be at the character level. On the one hand, the correspondence with the phoneme vector is better;
需要说明的是,本步骤中基于待转换文本生成对应的音素向量和文本向量的具体方式,以及目标人声的声纹特征向量的获取方式,均可由本领域技术人员根据实际情况采用适当方式(如神经网络模型或算法的方式)实现,本申请实施例对此不作限制。It should be noted that in this step, the specific method of generating the corresponding phoneme vector and text vector based on the text to be converted, as well as the method of obtaining the voiceprint feature vector of the target human voice, can be adopted by those skilled in the art according to the actual situation. Such as a neural network model or an algorithm), which is not limited in this embodiment of the present application.
步骤S204:根据音素向量和文本向量,获得待转换文本对应的语言学特征向量;根据文本向量和声纹特征向量,预测获得待转换文本的隐藏韵律矢量。Step S204: Obtain the linguistic feature vector corresponding to the text to be converted according to the phoneme vector and the text vector; predict and obtain the hidden prosody vector of the text to be converted according to the text vector and the voiceprint feature vector.
本申请实施例中,一方面,会将文本向量与音素向量相结合,生成携带韵律信息和语义信息的语言学特征向量;另一方面,会将文本向量与声纹特征向量相结合,预测出主要携带与文本有关的韵律信息的隐藏韵律矢量。In the embodiment of the present application, on the one hand, the text vector and the phoneme vector will be combined to generate a linguistic feature vector carrying prosodic information and semantic information; on the other hand, the text vector will be combined with the voiceprint feature vector to predict Hidden prosodic vectors that mainly carry prosodic information related to the text.
虽然两方面都使用了文本向量,但由上可见,使用文本向量所要实现的目标不同。因此,在一种可行方式中,两方面使用的文本向量可采用不同的方式获得。例如,可通过字符编码网络(也可称为字符编码器)获得与音素向量结合的文本向量;而通过上下文编码网络(也可称为上下文编码器如BERT模型)获得与声纹特征向量相结合的文本向量。由此,可以更好地满足不同部分的需求,也使得方案整体更具灵活性。其中,BERT模型的全称为:Bidirectional Encoder Representations from Transformer(来自转换器的双向编码器表示)。Although text vectors are used in both aspects, it can be seen from the above that the goals to be achieved by using text vectors are different. Therefore, in one possible way, the text vectors used by the two aspects can be obtained in different ways. For example, the text vector combined with the phoneme vector can be obtained through a character encoding network (also called a character encoder); A text vector of . As a result, the needs of different parts can be better met, and the overall solution is more flexible. Among them, the full name of the BERT model is: Bidirectional Encoder Representations from Transformer (bidirectional encoder representation from the converter).
步骤S206:根据语言学特征向量、隐藏韵律矢量和声纹特征向量,生成待转换文本对应的语音频谱信息。Step S206: Generate speech spectrum information corresponding to the text to be converted according to the linguistic feature vector, hidden prosody vector and voiceprint feature vector.
在获得了语言学特征向量、隐藏韵律矢量之后,结合之前获取的声纹特征向量,进行特征融合并基于融合后的特征进行相应处理如解码处理,即可获得语音频谱信息,其中包含有待转换文本的韵律信息。本申请实施例中,韵律信息包括但不限于语调、语速、能量和空间信息等。After obtaining the linguistic feature vector and hidden prosody vector, combine the previously obtained voiceprint feature vectors, perform feature fusion and perform corresponding processing based on the fused features such as decoding processing, and then the speech spectrum information can be obtained, including the text to be converted rhythm information. In this embodiment of the present application, the prosodic information includes but not limited to intonation, speech rate, energy and spatial information, and the like.
在一种可行方式中,可以对语言学特征向量、隐藏韵律矢量和声纹特征向量进行拼接,生成拼接向量;对拼接向量进行解码,获得待转换文本对应的语音频谱信息。因拼接后的向量中携带了丰富的与待转换文本中与韵律相关联的信息,因此,基于拼接向量进行解码获得的语音频谱信息也更为准确。In a feasible manner, the linguistic feature vector, the hidden prosody vector and the voiceprint feature vector may be concatenated to generate a concatenated vector; the concatenated vector may be decoded to obtain speech spectrum information corresponding to the text to be converted. Because the spliced vectors carry rich information associated with prosody in the text to be converted, the speech spectrum information obtained by decoding based on the spliced vectors is also more accurate.
在一种可行方式中,上述过程可通过神经网络模型实现,本申请中称为韵律模型,一种示例性的韵律模型如图2B所示。由图2B中可见,该韵律模型包括:音素编码网络(图中示意为Phoneme Encoder)、文本编码网络(图中示意为字符级别的Word Encoder)、隐藏韵律矢量预测网络(图中示意为LPV Predictor)、向量拼接层(图中示意为“+”号所在的虚线框部分)和解码网络(图中示意为Decoder所在的虚线框部分)。In a feasible manner, the above process can be realized by a neural network model, which is called a prosody model in this application, and an exemplary prosody model is shown in FIG. 2B . As can be seen from Figure 2B, the prosody model includes: a phoneme encoding network (shown as a Phoneme Encoder in the figure), a text encoding network (shown as a character-level Word Encoder in the figure), and a hidden prosody vector prediction network (shown as an LPV Predictor in the figure). ), the vector splicing layer (shown in the figure as the dotted box where the "+" sign is located), and the decoding network (shown in the figure as the dotted box where the Decoder is located).
其中,音素编码网络用于获取待转换文本对应的音素向量;文本编码网络用于获取待转换文本对应的文本向量;隐藏韵律矢量预测网络用于根据待转换文本对应的文 本向量和获取的目标人声的声纹特征向量,预测获得待转换文本的隐藏韵律矢量;向量拼接层用于对音素向量和文本向量进行加和,获得待转换文本对应的语言学特征向量;以及,对语言学特征向量、隐藏韵律矢量和声纹特征向量进行拼接,生成拼接向量;解码网络用于对拼接向量进行解码,获得待转换文本对应的语音频谱信息。Among them, the phoneme encoding network is used to obtain the phoneme vector corresponding to the text to be converted; the text encoding network is used to obtain the text vector corresponding to the text to be converted; the hidden prosody vector prediction network is used to obtain the target person according to the text vector corresponding to the text to be converted The voiceprint feature vector of the sound is predicted to obtain the hidden prosody vector of the text to be converted; the vector splicing layer is used to add the phoneme vector and the text vector to obtain the linguistic feature vector corresponding to the text to be converted; and, the linguistic feature vector , the hidden prosody vector and the voiceprint feature vector are spliced to generate a spliced vector; the decoding network is used to decode the spliced vector to obtain the speech spectrum information corresponding to the text to be converted.
在使用如图2B所示的韵律模型时,本申请实施例的方案可以实现为:通过韵律模型的音素编码网络获取待转换文本对应的音素向量;并且,通过韵律模型的文本编码网络获取待转换文本对应的文本向量;通过韵律模型的隐藏韵律矢量预测网络根据待转换文本对应的文本向量和获取的目标人声的声纹特征向量,预测获得待转换文本的隐藏韵律矢量;通过韵律模型的向量拼接层对音素向量和文本向量进行加和,获得待转换文本对应的语言学特征向量;以及,对语言学特征向量、隐藏韵律矢量和声纹特征向量进行拼接,生成拼接向量;通过韵律模型的解码网络对拼接向量进行解码,获得待转换文本对应的语音频谱信息。When using the prosodic model shown in Figure 2B, the solution of the embodiment of the present application can be implemented as follows: the phoneme vector corresponding to the text to be converted is obtained through the phoneme coding network of the prosodic model; and the text to be converted is obtained through the text coding network of the prosodic model The text vector corresponding to the text; through the hidden prosodic vector prediction network of the prosody model, predict the hidden prosodic vector of the text to be converted according to the text vector corresponding to the text to be converted and the acquired voiceprint feature vector of the target vocal; through the prosody model's vector The splicing layer adds the phoneme vector and the text vector to obtain the linguistic feature vector corresponding to the text to be converted; and splicing the linguistic feature vector, hidden prosodic vector and voiceprint feature vector to generate a splicing vector; The decoding network decodes the concatenated vectors to obtain the speech spectrum information corresponding to the text to be converted.
此外,如图中所示,本示例中的韵律模型的解码网络部分还设置有长度规整层Length Regulator和线性化层Linear Layer。其中,Length Regulator用于调整语言学特征向量、隐藏韵律矢量和声纹特征向量的长度,使它们长度与语音频谱信息一致。Linear Layer则用于对Decoder的输出进行线性化处理。In addition, as shown in the figure, the decoding network part of the prosody model in this example is also provided with a Length Regulator and a Linear Layer. Among them, Length Regulator is used to adjust the lengths of linguistic feature vectors, hidden prosody vectors and voiceprint feature vectors so that their lengths are consistent with the voice spectrum information. The Linear Layer is used to linearize the output of the Decoder.
由图2B中可见,Word Encoder和LPV Predictor虽然均对“Word”进行处理,但为了使“Word”更能符合各部分的需求,也使得韵律模型更为灵活,在一种可选方式中,文本编码网络包括字符编码网络和上下文编码网络。其中,字符编码网络如图中所示的Word Encoder,用于对待转换文本进行字符级别的编码,生成用于和音素向量进行加和的字符文本向量。上下文编码网络可以为诸如BERT网络或者其它可生成文本向量的网络,用于对待转换文本进行字符级别的编码,生成用于与声纹特征向量一起输入隐藏韵律矢量预测网络的字符文本向量。但如前所述,这两个编码网络也可采用同样的结构,同样适用于本申请实施例的方案。It can be seen from Fig. 2B that although Word Encoder and LPV Predictor both process "Word", in order to make "Word" better meet the needs of each part, the prosody model is also more flexible. In an optional way, Text encoding network includes character encoding network and context encoding network. Among them, the character encoding network, as shown in the Word Encoder in the figure, is used to encode the text to be converted at the character level, and generate a character text vector for summing up with the phoneme vector. The context encoding network can be such as BERT network or other network that can generate text vectors, which is used to encode the text to be converted at the character level, and generate character text vectors for inputting into the hidden prosody vector prediction network together with the voiceprint feature vector. However, as mentioned above, the two encoding networks may also adopt the same structure, which is also applicable to the solution of the embodiment of the present application.
以下,基于上述韵律模型,从语音合成过程的角度对本实施例的数据转换方法进行示例性说明,如图2C所示。Hereinafter, based on the above prosodic model, the data conversion method of this embodiment is exemplarily described from the perspective of the speech synthesis process, as shown in FIG. 2C .
语音合成过程通常包括:前端处理、声学模型处理、声码器处理三部分。其中,前端处理主要是从待转换文本中获得发音和语言学的信息,包括但不限于:文本归一化(文本标准化)、字形转音形(如把文本字符转化为音素等发音信息,以便后续的声学模型可以准确地获该文本字符的发音)等等。The speech synthesis process usually includes three parts: front-end processing, acoustic model processing, and vocoder processing. Among them, the front-end processing is mainly to obtain pronunciation and linguistic information from the text to be converted, including but not limited to: text normalization (text standardization), font conversion (such as converting text characters into phonemes and other pronunciation information, so that The subsequent acoustic model can accurately obtain the pronunciation of the text character) and so on.
声学模型处理部分主要有声学模型完成,本示例中实现为上述韵律模型,该韵律模型基于前端处理产生的发音信息或者语言学信息来产生声学的特征,如梅尔频谱图。具体到本示例,韵律模型基于待转换文本的音素、字符级别的文本及待转换成的目标人声的声纹特征,输出梅尔频谱图。该过程如前所述,在此不再赘述。The acoustic model processing part is mainly completed by the acoustic model. In this example, the above-mentioned prosody model is implemented. The prosody model generates acoustic features based on the pronunciation information or linguistic information generated by the front-end processing, such as the Mel spectrogram. Specifically in this example, the prosody model outputs a mel-spectrogram based on the phonemes of the text to be converted, the text at the character level, and the voiceprint features of the target human voice to be converted. The process is as described above, and will not be repeated here.
经韵律模型输出的梅尔频谱图将被输入声码器,由声码器基于梅尔频谱图来合成最后声音的波形图。从而,完成从文本到语音的TTS转换过程。The mel-spectrogram output by the prosody model will be input into the vocoder, and the vocoder will synthesize the final sound waveform based on the mel-spectrogram. Thus, the TTS conversion process from text to speech is completed.
在一个人机交互场景示例中,该语音合成过程包括:获取向智能设备发送的用户指令的响应,所述响应中包含有针对用户指令的待回复文本;获取待回复文本对应的音素向量、文本向量和目标人声的声纹特征向量;根据音素向量和文本向量,获得待 回复文本对应的语言学特征向量;根据文本向量和声纹特征向量,预测获得待回复文本的隐藏韵律矢量;根据语言学特征向量、隐藏韵律矢量和声纹特征向量,生成待回复文本对应的语音频谱信息;根据语音频谱信息生成待回复文本对应的语音并播放。In an example of a human-computer interaction scenario, the speech synthesis process includes: obtaining a response to a user instruction sent to the smart device, the response containing the text to be replied to the user instruction; obtaining the phoneme vector and text corresponding to the text to be replied vector and the voiceprint feature vector of the target voice; according to the phoneme vector and the text vector, obtain the linguistic feature vector corresponding to the text to be replied; according to the text vector and the voiceprint feature vector, predict and obtain the hidden prosody vector of the text to be replied; according to the language The learning feature vector, the hidden prosody vector and the voiceprint feature vector are used to generate the speech spectrum information corresponding to the text to be replied; the speech corresponding to the text to be replied is generated and played according to the speech spectrum information.
本示例中,假设为人机交互场景,智能设备示例为智能音箱,用户指令示例为用户发出的语音问题,待回复文本则对应地为针对该语音问题的回复。则,用户X向智能音箱提出一个语音问题“1加1等于多少”。智能音箱在接收到该语音问题后将其转换为文本,并发送至服务端进行查询;在获取到服务端返回的查询结果“1加1等于2”。智能音箱在接收到该查询结果后,会将查询结果中的每个字符转换成音素,形成音素序列。又因智能音箱有其自身的声纹特征。因此,智能音箱会按照字符顺序将音素序列中相对应的音素及字符以及声纹特征作为韵律模型的输入,经韵律模型的上述处理输出梅尔频谱图;再将梅尔频谱图输入声码器,由声码器合成最后的语音播放。由此,实现对用户X的语音问题的回复。In this example, assume a human-computer interaction scenario, the example of a smart device is a smart speaker, the example of a user instruction is a voice question issued by the user, and the text to be replied corresponds to the reply to the voice question. Then, user X asks the smart speaker a voice question "what is 1 plus 1 equal to?". After receiving the voice question, the smart speaker converts it into text and sends it to the server for query; after obtaining the query result returned by the server, "1 plus 1 equals 2". After receiving the query result, the smart speaker will convert each character in the query result into a phoneme to form a phoneme sequence. And because smart speakers have their own voiceprint characteristics. Therefore, the smart speaker will use the corresponding phonemes and characters in the phoneme sequence and the voiceprint features as the input of the prosody model in the order of characters, and output the Mel spectrogram through the above-mentioned processing of the prosody model; then input the Mel spectrogram into the vocoder , and the final speech playback is synthesized by the vocoder. In this way, the reply to the voice question of user X is realized.
图2C中,为了便于说明,将韵律模型及声码器部分均单独示意,但本领域技术人员应当明了的是,在实际应用中,韵律模型及声码器均设置于智能音箱内,通过智能音箱中的相应部件如处理器控制执行。In Fig. 2C, for the convenience of explanation, the prosody model and the vocoder are shown separately, but those skilled in the art should understand that, in practical applications, the prosody model and the vocoder are both set in the smart speaker, through the smart Corresponding components in the loudspeaker, such as processors, control the execution.
在另一个直播场景示例中,该语音合成过程可以包括:获取待直播对象对应的直播剧本文本;获取直播剧本文本对应的音素向量、文本向量和目标人声的声纹特征向量;根据音素向量和文本向量,获得直播剧本文本对应的语言学特征向量;根据文本向量和声纹特征向量,预测获得直播剧本文本的隐藏韵律矢量;根据语言学特征向量、隐藏韵律矢量和声纹特征向量,生成直播剧本文本对应的语音频谱信息;根据语音频谱信息生成直播剧本文本对应的直播语音。In another example of a live broadcast scenario, the speech synthesis process may include: obtaining the live script text corresponding to the object to be broadcast live; obtaining the phoneme vector, text vector, and voiceprint feature vector of the target human voice corresponding to the live script text; according to the phoneme vector and Text vector, to obtain the linguistic feature vector corresponding to the live script text; predict and obtain the hidden prosodic vector of the live script text according to the text vector and voiceprint feature vector; generate the live broadcast according to the linguistic feature vector, hidden prosodic vector and voiceprint feature vector Voice spectrum information corresponding to the script text; generating live voice corresponding to the live script text according to the voice spectrum information.
其中,直播对象对应的直播剧本可以是多个直播对象(如商品、或内容或节目等)对应的直播剧本如整场直播的剧本,也可以是多个直播对象中的某个或某些对应的直播剧本。在获得了直播剧本的情况下,可以采用如前所述的方法,将直播剧本最终转换为直播语音,以应用至直播场景中,如直播带货或直播内容推广,等等。该直播语音可以与虚拟主播适配,也可以与真人主播适配,可在直播场景中广泛适用。Wherein, the live broadcast script corresponding to the live broadcast object can be a live broadcast script corresponding to multiple live broadcast objects (such as commodities, or content or programs, etc.), such as the script of the whole live broadcast, or one or some corresponding scripts of multiple live broadcast objects. live script. After obtaining the live broadcast script, the above-mentioned method can be used to finally convert the live broadcast script into live voice, so as to be applied to live broadcast scenarios, such as live broadcast delivery or live content promotion, and so on. The live broadcast voice can be adapted to a virtual anchor or a real anchor, and can be widely used in live broadcast scenarios.
在再一个演播场景中,该语音合成过程可以包括:获取待演播的剧本文本;获取剧本文本对应的音素向量、文本向量和目标人声的声纹特征向量;根据音素向量和文本向量,获得剧本文本对应的语言学特征向量;根据文本向量和声纹特征向量,预测获得剧本文本的隐藏韵律矢量;根据语言学特征向量、隐藏韵律矢量和声纹特征向量,生成剧本文本对应的语音频谱信息;根据语音频谱信息生成剧本文本对应的演播语音。In yet another broadcast scenario, the speech synthesis process may include: obtaining the script text to be broadcast; obtaining the phoneme vector, text vector, and voiceprint feature vector of the target vocal corresponding to the script text; obtaining the script according to the phoneme vector and the text vector The linguistic feature vector corresponding to the text; predict and obtain the hidden prosodic vector of the script text according to the text vector and the voiceprint feature vector; generate the voice spectrum information corresponding to the script text according to the linguistic feature vector, hidden prosodic vector and voiceprint feature vector; According to the voice spectrum information, the performance voice corresponding to the script text is generated.
其中,待演播的剧本文本包括以下之一:音频或视频对应的台词剧本、电子书文本内容。在获得了剧本文本的情况下,可以采用如前所述的方法,将剧本文本最终转换为演播语音,以应用至演播场景中。例如,可以使用该演播语音为视频角色配音,或者实现音频生成,或者实现有声电子书等等。Wherein, the script text to be performed includes one of the following: line scripts corresponding to audio or video, and text content of e-books. After the script text is obtained, the above-mentioned method can be used to finally convert the script text into a performance voice, so as to be applied to the performance scene. For example, the studio voice can be used to dub video characters, or realize audio generation, or realize audio e-books and so on.
可见,通过本实施例,在获取需要转换为语音的待转换文本的频谱时,综合考虑了待转换文本的音素、文本和目标人声的声纹特征。其中,基于音素和文本可以获得待转换文本的语言学特征,该特征携带有文本所对应的级别(如字符级别、词级别、句子级别等)的发音特征;基于文本和声纹特征可以预测获得待转换文本的隐藏韵律 矢量,该矢量主要包含韵律的信息,采用这种方式下的韵律基于文本对应的特征获得,更为关注韵律自身的特性。而基于语言学特征、隐藏韵律矢量和声纹特征经处理最终获得的语音频谱信息,则更为贴合实际的声纹特征对应的目标人声的语音特点,与实际的目标人声的韵律更为接近。由此,使得后续基于获得的语音频谱信息生成的语音也更与实际人声相近。It can be seen that through this embodiment, when acquiring the frequency spectrum of the text to be converted that needs to be converted into speech, the phonemes of the text to be converted, the voiceprint features of the text and the target human voice are taken into consideration. Among them, the linguistic features of the text to be converted can be obtained based on phonemes and texts, which carry the pronunciation features of the corresponding level of the text (such as character level, word level, sentence level, etc.); based on text and voiceprint features, it can be predicted and obtained The hidden prosodic vector of the text to be converted, which mainly contains prosody information. The prosody in this way is obtained based on the corresponding features of the text, and more attention is paid to the characteristics of the prosody itself. The speech spectrum information obtained after processing based on linguistic features, hidden prosodic vectors and voiceprint features is more in line with the speech characteristics of the target human voice corresponding to the actual voiceprint features, and is closer to the actual rhythm of the target human voice. for close. As a result, the subsequent speech generated based on the obtained speech spectrum information is also closer to the actual human voice.
可见,通过本实施例的方案,一方面,不再基于基频进行韵律建模,而是依据与韵律相关的多种信息进行韵律信息的提取,能够使得提取出的韵律更为准确;另一方面,综合考虑了影响韵律的多种因素(如音素、文本、目标人声的声纹等)之间的关系,也使得由此获得的韵律更为准确。It can be seen that through the solution of this embodiment, on the one hand, the prosody modeling is no longer based on the fundamental frequency, but the prosody information is extracted based on various information related to the prosody, which can make the extracted prosody more accurate; on the other hand On the one hand, the comprehensive consideration of the relationship between various factors affecting the prosody (such as phoneme, text, voiceprint of the target human voice, etc.) also makes the prosody thus obtained more accurate.
实施例二Embodiment two
参照图3A,示出了根据本申请实施例二的一种数据转换方法的步骤流程图。Referring to FIG. 3A , it shows a flowchart of steps of a data conversion method according to Embodiment 2 of the present application.
本实施例以使用韵律模型进行数据转换为示例,先对该韵律模型的训练过程进行介绍,进而基于训练完成的韵律模型进行数据转换。In this embodiment, data conversion using a prosodic model is taken as an example, and a training process of the prosodic model is firstly introduced, and then data conversion is performed based on the trained prosody model.
本实施例的数据转换方法包括以下步骤:The data conversion method of the present embodiment includes the following steps:
步骤S302:获取训练样本,使用训练样本对韵律模型进行训练。Step S302: Obtain training samples, and use the training samples to train the prosodic model.
其中,训练样本包括待转换文本样本及对应的语音样本、和声纹特征样本向量,本申请实施例中,语音样本使用低频段语音样本,如频段为0-2KHz(千赫兹)频段的语音样本。一方面,低频段语音样本中携带有充分的与韵律相关的信息,不会对训练效果造成影响;另一方面,去除掉低频段之外频段的语音,可使模型结构较为简单。但需要说明的是,全频段语音样本也同样适用于本申请实施例的方案。此外,还可采用含有噪音的低质量语音样本,不再局限于高质量语音样本,这样,诸如视频中的音频、常规音频、广播音频等等均可作为本申请实施例中的语音样本,大大丰富了语音样本数量和选取范围,并降低了语音样本的获取成本。Wherein, the training sample includes the text sample to be converted and the corresponding voice sample, and the voiceprint feature sample vector. In the embodiment of the present application, the voice sample uses a low-frequency voice sample, such as a voice sample with a frequency band of 0-2KHz (kilohertz) . On the one hand, the low-frequency speech samples carry sufficient prosody-related information, which will not affect the training effect; on the other hand, removing the speech in frequency bands other than the low-frequency band can make the model structure simpler. However, it should be noted that the full-band voice samples are also applicable to the solutions of the embodiments of the present application. In addition, low-quality speech samples containing noise can also be used, and are no longer limited to high-quality speech samples. In this way, audio in video, conventional audio, broadcast audio, etc. can be used as speech samples in the embodiments of the present application, greatly improving The number and selection range of speech samples are enriched, and the acquisition cost of speech samples is reduced.
本实施例中,韵律模型如图3B所示,其包括:音素编码网络(图中示意为Phoneme Encoder)、文本编码网络、韵律编码网络(图中示意为Prosody Encoder)、隐藏韵律矢量预测网络(图中示意为LPV Predictor)、向量拼接层(图中示意为“+”号所在的虚线框部分)和解码网络(图中示意为Decoder所在的虚线框部分)。In the present embodiment, the prosody model is as shown in Figure 3B, which includes: phoneme encoding network (shown as Phoneme Encoder in the figure), text encoding network, prosody encoding network (shown as Prosody Encoder in the figure), hidden prosody vector prediction network ( The figure shows the LPV Predictor), the vector splicing layer (the figure shows the dotted box where the "+" sign is located), and the decoding network (the figure shows the Decoder where the dotted box is located).
基于该结构,对韵律模型的训练包括:将待转换文本样本对应的音素输入音素编码网络,获得对应的音素样本向量;将待转换文本样本的字符输入文本编码网络,获得对应的字符样本文本向量;将语音样本、音素样本向量、字符样本文本向量和声纹特征样本向量输入韵律编码网络,获得对应的第一隐藏韵律样本矢量(图3B中的韵律模型所示的LPV);基于音素样本向量、字符样本文本向量、声纹特征样本向量和第一隐藏韵律样本矢量,对韵律模型进行训练。Based on this structure, the training of the prosodic model includes: inputting the phoneme corresponding to the text sample to be converted into the phoneme encoding network to obtain the corresponding phoneme sample vector; inputting the characters of the text sample to be converted into the text encoding network to obtain the corresponding character sample text vector ; Speech samples, phoneme sample vectors, character sample text vectors and voiceprint feature sample vectors are input into the prosody encoding network to obtain the corresponding first hidden prosodic sample vector (LPV shown in the prosody model in Fig. 3B); based on the phoneme sample vector , the character sample text vector, the voiceprint feature sample vector and the first hidden prosodic sample vector to train the prosodic model.
其中,为使模型更灵活,将文本编码网络分为字符编码网络(图中示意为字符级别的Word Encoder)和上下文编码网络(图中示意为右上角的Context Encoder)。基于此,将待转换文本的字符输入文本编码网络,获得对应的字符样本文本向量可以实现为:将待转换文本样本的字符分别输入字符编码网络和上下文编码网络,获得对应的第一字符样本文本向量和第二字符样本文本向量。相应地,将语音样本、音素样本向量、字符样本文本向量和声纹特征样本向量输入韵律编码网络,获得对应的第一隐 藏韵律样本矢量可以实现为:将语音样本、音素样本向量、第一字符样本文本向量和声纹特征样本向量输入韵律编码网络,获得对应的第一隐藏韵律样本矢量。Among them, in order to make the model more flexible, the text encoding network is divided into a character encoding network (shown as a character-level Word Encoder) and a context encoding network (shown as a Context Encoder in the upper right corner). Based on this, inputting the characters of the text to be converted into the text encoding network and obtaining the corresponding character sample text vector can be implemented as follows: input the characters of the text sample to be converted into the character encoding network and the context encoding network respectively, and obtain the corresponding first character sample text vector and the second character sample text vector. Correspondingly, inputting speech samples, phoneme sample vectors, character sample text vectors and voiceprint feature sample vectors into the prosodic encoding network, and obtaining the corresponding first hidden prosodic sample vectors can be realized as follows: speech samples, phoneme sample vectors, first character The sample text vector and the voiceprint feature sample vector are input into the prosody encoding network to obtain the corresponding first hidden prosody sample vector.
此外,本实施例中,解码网络部分除Decoder外,还设置有长度规整层Length Regulator和线性化层Linear Layer。其中,Length Regulator用于调整语言学特征样本向量、第一隐藏韵律样本矢量和声纹特征样本向量的长度,使它们长度与语音频谱信息一致。Linear Layer则用于对Decoder的输出进行线性化处理。In addition, in this embodiment, in addition to the Decoder, the decoding network part is also provided with a Length Regulator and a Linear Layer. Among them, the Length Regulator is used to adjust the length of the linguistic feature sample vector, the first hidden prosody sample vector and the voiceprint feature sample vector, so that their lengths are consistent with the voice spectrum information. The Linear Layer is used to linearize the output of the Decoder.
此种结构下,对于图3B中左侧L形虚线框部分的训练包括:将输入待转换文本样本的文本序列转换成音素序列(图中示意为Phoneme)和字符序列(图中示意为Word),分别输入音素编码网络Phoneme Encoder和字符编码网络Word Encoder。然后通过Phoneme Encoder获得音素样本向量Phoneme Eembedding,通过Word Encoder获得第一字符样本文本向量Word Eembedding。进而,对Phoneme Eembedding和Word Eembedding进行加和,获得语言学特征样本向量H_ling。然后,基于H_ling和H_spk(声纹特征样本向量,为矢量),样本人声的梅尔频谱(mel-spec),即语音样本的低频部分(如0-2KHz部分)通过韵律编码网络Prosody Encoder获得第一隐藏韵律样本矢量(latent prosody vectors,LPV)。然后,H_ling、H_spk和第一隐藏韵律样本矢量拼接在一起,送入后续的解码网络获得预测的梅尔谱。Under this kind of structure, the training for the L-shaped dotted line box on the left side in Figure 3B includes: converting the text sequence of the input text sample to be converted into a phoneme sequence (shown as Phoneme in the figure) and a character sequence (shown as Word in the figure) , input phoneme encoding network Phoneme Encoder and character encoding network Word Encoder respectively. Then the phoneme sample vector Phoneme Eembedding is obtained through the Phoneme Encoder, and the first character sample text vector Word Eembedding is obtained through the Word Encoder. Furthermore, Phoneme Eembedding and Word Eembedding are summed to obtain the linguistic feature sample vector H_ling. Then, based on H_ling and H_spk (voiceprint feature sample vector, which is a vector), the mel-spec (mel-spec) of the sample human voice, that is, the low-frequency part of the voice sample (such as the 0-2KHz part) is obtained through the prosody encoding network Prosody Encoder The first hidden prosody vectors (latent prosody vectors, LPV). Then, H_ling, H_spk and the first hidden prosodic sample vector are spliced together and sent to the subsequent decoding network to obtain the predicted mel spectrum.
本实施例中,韵律编码网络Prosody Encoder的训练过程可示例性地实现为:通过韵律编码网络的第一卷积层基于音素样本向量和声纹特征样本向量,对语音样本进行特征提取,获得第一韵律样本特征;通过韵律编码网络的池化层对第一韵律样本特征进行字符级别的池化处理,获得字符级别的韵律样本特征;通过韵律编码网络的第二卷积层基于第一字符样本文本向量和声纹特征样本向量,对字符级别的韵律样本特征进行特征提取,获得第二韵律样本特征;通过韵律编码网络的矢量化层对第二韵律样本特征进行矢量化处理,获得第一隐藏韵律样本矢量。通过这种方式,简化了韵律编码网络结构,且能有效提取隐藏韵律样本矢量。In this embodiment, the training process of the prosody encoding network Prosody Encoder can be exemplarily implemented as follows: through the first convolutional layer of the prosody encoding network based on the phoneme sample vector and the voiceprint feature sample vector, feature extraction is performed on the speech sample to obtain the first A prosodic sample feature; perform character-level pooling processing on the first prosodic sample feature through the pooling layer of the prosodic encoding network to obtain character-level prosodic sample features; pass the second convolutional layer of the prosodic encoding network based on the first character sample The text vector and the voiceprint feature sample vector extract the character-level prosodic sample features to obtain the second prosodic sample features; the second prosodic sample features are vectorized through the vectorization layer of the prosody coding network to obtain the first hidden Rhythm sample vector. In this way, the prosodic encoding network structure is simplified, and the hidden prosodic sample vector can be extracted effectively.
示例性地,如图3B中的(b)部分所示,韵律编码网络Prosody Encoder的输入是待转换文本样本对应的语音样本的梅尔频谱的低频部分、Phoneme Eembedding和Word Eembedding(为简化表达,文中简单示意为H_ling)和H_spk,输出是字符级别的第一隐藏韵律样本矢量序列。韵律编码网络Prosody Encoder包含两级Conv Stacks(卷积堆栈):第一级Conv Stacks在对梅尔频谱的低频部分进行处理时,输入除梅尔频谱的低频部分外,还有Phoneme Eembedding和H_spk,通过Phoneme Eembedding的加入,使得对梅尔频谱的低频部分的卷积处理可以过滤掉音素对韵律的影响,然后,再将卷积处理后的梅尔频谱的低频部分通过字符级别的池化层Word-level Pooling的池化操作压缩至字符级别;第二级Conv Stacks则基于第一级Conv Stacks的输出和Word Eembedding、H_spk获得隐藏韵律表达,通过Word Eembedding的加入,使得对梅尔频谱的低频部分的卷积处理可以过滤掉字符语义对韵律的影响;最后,基于这个隐藏韵律表达,通过矢量量化层(Vector Quantization)获得字符级别的第一隐藏韵律样本矢量序列。Exemplarily, as shown in part (b) of Figure 3B, the input of the prosody encoding network Prosody Encoder is the low-frequency part of the Mel spectrum of the voice sample corresponding to the text sample to be converted, Phoneme Eembedding and Word Eembedding (for simplified expression, In the text, it is simply indicated as H_ling) and H_spk, and the output is the first hidden prosodic sample vector sequence at the character level. The prosody encoding network Prosody Encoder contains two levels of Conv Stacks (convolution stack): when the first level of Conv Stacks processes the low frequency part of the Mel spectrum, the input includes Phoneme Eembedding and H_spk in addition to the low frequency part of the Mel spectrum. Through the addition of Phoneme Eembedding, the convolution processing of the low-frequency part of the Mel spectrum can filter out the influence of the phoneme on the prosody, and then the low-frequency part of the convolution-processed Mel spectrum is passed through the character-level pooling layer Word The pooling operation of -level Pooling is compressed to the character level; the second-level Conv Stacks is based on the output of the first-level Conv Stacks and Word Eembedding, H_spk to obtain the hidden prosodic expression, and the addition of Word Eembedding makes the low-frequency part of the Mel spectrum The convolution processing of can filter out the impact of character semantics on prosody; finally, based on this hidden prosody expression, the first hidden prosody sample vector sequence at the character level is obtained through the vector quantization layer (Vector Quantization).
在获得了第一隐藏韵律样本矢量之后,即可基于音素样本向量、第一字符样本文本向量、声纹特征样本向量和第一隐藏韵律样本矢量,对韵律模型进行训练。具体地, 可以包括:通过向量拼接层对音素样本向量和第一字符样本文本向量进行加和,获得语言学特征向量;以及,对语言学特征向量、声纹特征样本向量和第一隐藏韵律样本矢量进行拼接,获得拼接样本向量;通过解码网络对拼接样本向量进行解码,根据解码结果对韵律模型进行训练。After obtaining the first hidden prosodic sample vector, the prosodic model can be trained based on the phoneme sample vector, the first character sample text vector, the voiceprint feature sample vector and the first hidden prosodic sample vector. Specifically, it may include: adding the phoneme sample vector and the first character sample text vector through the vector splicing layer to obtain the linguistic feature vector; and, the linguistic feature vector, the voiceprint feature sample vector and the first hidden prosody sample The vectors are spliced to obtain the spliced sample vectors; the spliced sample vectors are decoded through the decoding network, and the prosody model is trained according to the decoding results.
在一种可选方案中,在通过解码网络对拼接样本向量进行解码之前,还可以通过长度规整层对拼接样本向量进行长度规整处理;然后,再通过解码网络对进行了长度规整处理后的拼接样本向量进行解码。具体可如图3B的(a)中所示。In an optional solution, before decoding the spliced sample vectors through the decoding network, the length regularization process can also be performed on the spliced sample vectors through the length regularization layer; Sample vector to decode. Specifically, it may be shown in (a) of FIG. 3B .
此外,韵律编码网络Prosody Encoder不仅参与图3B(a)中左侧L形虚线框部分的训练,其还承担着对隐藏韵律矢量预测网络LPV Predictor的训练任务。在韵律模型的推理阶段,将主要由LPV Predictor实现韵律预测,韵律编码网络Prosody Encoder将不再发挥功能。因此,对韵律模型的训练还包括:将第二字符样本文本向量和声纹特征样本向量输入隐藏韵律矢量预测网络,预测获得第二隐藏韵律样本矢量;根据第一隐藏韵律样本矢量和第二隐藏韵律样本矢量的差异,对隐藏韵律矢量预测网络进行训练。In addition, the prosody encoding network Prosody Encoder not only participates in the training of the left L-shaped dashed box in Figure 3B(a), but also undertakes the training task of the hidden prosody vector prediction network LPV Predictor. In the reasoning stage of the prosody model, the prosody prediction will be mainly realized by the LPV Predictor, and the prosody encoding network Prosody Encoder will no longer function. Therefore, the training of the prosodic model also includes: inputting the second character sample text vector and the voiceprint feature sample vector into the hidden prosodic vector prediction network, and predicting and obtaining the second hidden prosodic sample vector; according to the first hidden prosodic sample vector and the second hidden Differences in prosodic sample vectors to train the hidden prosodic vector prediction network.
其中,如前所述,第二字符样本文本向量的获得可采用如图3B中右上角部分的上下文编码网络Context Encoder实现,其具体结构可采用BERT模型结构。但本领域技术人员应当明了的是,其它结构如任意的纯文本训练模型结构也可同样适用于本申请实施例的方案。Among them, as mentioned above, the acquisition of the second character sample text vector can be realized by using the context encoding network Context Encoder in the upper right corner of Figure 3B, and its specific structure can adopt the BERT model structure. However, those skilled in the art should understand that other structures, such as any plain text training model structure, are also applicable to the solutions of the embodiments of the present application.
一种对隐藏韵律矢量预测网络进行训练的简单示意如图3B中右下角所示。从中可见,韵律编码网络Prosody Encoder基于语音样本的梅尔频谱的低频部分(也就是图3B的右下角所示的噪音语音noisy audio)、Phoneme Eembedding和Word Eembedding(为简化表达,文中简单示意为H_ling)和H_spk,输出第一隐藏韵律样本矢量。LPV Predictor基于待转换文本的字符序列(图中示意为Word)和H_spk,输出第二隐藏韵律样本矢量。图3B的(d)中将这两个隐藏韵律样本矢量均示意为LPV。基于这两个LPV和预设的损失函数,即可对LPV Predictor进行训练。其中,所述损失函数可以为任意适当的函数,包括但不限于距离函数如余弦距离函数等,本申请实施例对此不作限制。A simple schematic diagram of training the hidden rhythm vector prediction network is shown in the lower right corner of Fig. 3B. It can be seen that the prosody encoding network Prosody Encoder is based on the low-frequency part of the Mel spectrum of the speech sample (that is, the noise speech noisy audio shown in the lower right corner of Figure 3B), Phoneme Eembedding and Word Eembedding (to simplify the expression, the text is simply indicated as H_ling ) and H_spk, output the first hidden prosodic sample vector. LPV Predictor outputs the second hidden prosodic sample vector based on the character sequence of the text to be converted (shown as Word in the figure) and H_spk. In (d) of Fig. 3B, the two hidden prosody sample vectors are both shown as LPV. Based on these two LPVs and the preset loss function, the LPV Predictor can be trained. Wherein, the loss function may be any appropriate function, including but not limited to a distance function such as a cosine distance function, which is not limited in this embodiment of the present application.
LPV Predictor为一个自回归预测模型,由图3B中的(c)可见,一方面,LPV Predictor将输入的Word通过Context Encoder转换为字符向量,为与前述Word Encoder输出的Word Eembedding相区别,将Context Encoder输出的字符向量表示为Hi;另一方面,LPV Predictor在处理当前字符时,还以前一字符对应的LPV(图中示意为LPV i-1)为参考。在对LPV i-1进行自注意力计算后,与Hi和H_spk进行拼接,进而对拼接后的向量进行后续处理(如归一化、卷积等),其中归一化层比如说可以表示为add&norm,其中add表示残差处理,norm表示标准化处理,以及卷积层比如说可以是Conv1D(一维卷积),并最终获得针对当前字符的韵律预测结果,即LPV i。也即,该预测过程可实现为:将待预测的当前字符对应的第二字符样本文本向量和声纹特征样本向量输入隐藏韵律矢量预测网络;对当前字符对应的第二字符样本文本向量、声纹特征样本向量和当前字符的前一字符对应的第二隐藏韵律样本矢量进行特征融合;基于融合后的特征向量,预测获得当前字符的第二隐藏韵律样本矢量。通过自回归的方式,可以获 得更为准确的韵律信息。 LPV Predictor is an autoregressive prediction model, which can be seen from (c) in Figure 3B. On the one hand, LPV Predictor converts the input Word into a character vector through the Context Encoder. In order to distinguish it from the Word Eembedding output by the aforementioned Word Encoder, the Context The character vector output by the Encoder is expressed as Hi; on the other hand, when the LPV Predictor processes the current character, it also uses the LPV corresponding to the previous character (LPV i-1 in the figure) as a reference. After performing self-attention calculation on LPV i-1 , it is spliced with Hi and H_spk, and then the spliced vector is subjected to subsequent processing (such as normalization, convolution, etc.), where the normalization layer, for example, can be expressed as add&norm, where add means residual processing, norm means normalization processing, and the convolutional layer can be, for example, Conv1D (one-dimensional convolution), and finally obtain the prosody prediction result for the current character, that is, LPV i . That is, the prediction process can be implemented as: inputting the second character sample text vector and the voiceprint feature sample vector corresponding to the current character to be predicted into the hidden prosodic vector prediction network; Perform feature fusion with the pattern feature sample vector and the second hidden prosodic sample vector corresponding to the previous character of the current character; based on the fused feature vector, predict and obtain the second hidden prosodic sample vector of the current character. More accurate prosodic information can be obtained through autoregressive methods.
通过上述过程,即可实现对本实施例中的韵律模型的各部分的训练,在训练完成后,即可进行从文本到频谱的数据转换。Through the above process, the training of each part of the prosody model in this embodiment can be realized, and after the training is completed, data conversion from text to frequency spectrum can be performed.
步骤S304:获取待转换文本对应的音素向量、文本向量和目标人声的声纹特征向量。Step S304: Obtain the phoneme vector, text vector and voiceprint feature vector of the target human voice corresponding to the text to be converted.
例如,采用训练完成后的图3B中的韵律模型,通过其音素编码网络Phoneme Encoder将待转换文本的音素序列编码为音素向量Phoneme Eembedding;通过其字符编码网络Word Encoder将待转换文本的字符序列转换为字符文本向量Word Eembedding。而目标人声的声纹特征向量H_spk则可预先获得,而基于目标人声提取其声纹特征向量的具体手段在本申请实施例中不作限制。For example, using the prosody model in Figure 3B after the training is completed, the phoneme sequence of the text to be converted is encoded into a phoneme vector Phoneme Eembedding through its phoneme encoding network Phoneme Encoder; the character sequence of the text to be converted is converted through its character encoding network Word Encoder Word Eembedding is a character text vector. The voiceprint feature vector H_spk of the target voice can be obtained in advance, and the specific means of extracting the voiceprint feature vector based on the target voice is not limited in this embodiment of the present application.
步骤S306:根据音素向量和文本向量,获得待转换文本对应的语言学特征向量;根据文本向量和声纹特征向量,预测获得待转换文本的隐藏韵律矢量。Step S306: According to the phoneme vector and the text vector, obtain the linguistic feature vector corresponding to the text to be converted; according to the text vector and the voiceprint feature vector, predict and obtain the hidden prosodic vector of the text to be converted.
例如,用训练完成后的图3B中的韵律模型,通过向量拼接层对Phoneme Eembedding和Word Eembedding进行加和,获得语言学特征向量H_ling;通过LPV Predictor获得待转换文本的隐藏韵律矢量LPV。For example, use the prosody model in Figure 3B after training, add Phoneme Eembedding and Word Eembedding through the vector splicing layer to obtain the linguistic feature vector H_ling; obtain the hidden prosody vector LPV of the text to be converted through LPV Predictor.
步骤S308:根据语言学特征向量、隐藏韵律矢量和声纹特征向量,生成待转换文本对应的语音频谱信息。Step S308: Generate speech spectrum information corresponding to the text to be converted according to the linguistic feature vector, hidden prosody vector and voiceprint feature vector.
例如,用训练完成后的图3B中的韵律模型,通过向量拼接层对H_ling、LPV和H_spk进行拼接。进而,依次通过解码网络中的Length Regulator、Decoder和Linear Layer进行与解码有关的处理,最终获得待转换文本对应的语音频谱信息。For example, use the prosody model in Figure 3B after training to stitch H_ling, LPV, and H_spk through the vector stitching layer. Furthermore, the processing related to decoding is performed through the Length Regulator, Decoder and Linear Layer in the decoding network in turn, and finally the speech spectrum information corresponding to the text to be converted is obtained.
进一步地,在获得了语音频谱信息的基础上,通过声码器即可输出相应的语音,实现从文本到语音的转换。Further, on the basis of obtaining the speech spectrum information, the corresponding speech can be output through the vocoder, realizing the conversion from text to speech.
需要说明的是,上述步骤S304-S308的描述较为简单,相关部分可参照前述实施例一及步骤S302中的有关描述。It should be noted that the descriptions of the above steps S304-S308 are relatively simple, and relevant parts may refer to the relevant descriptions in the foregoing first embodiment and step S302.
通过本实施例,使用隐藏韵律矢量来表征韵律,而非韵律成分,避免了传统方式中由于基频提取不准确以及针对各个韵律成分的预测缺少关联性,而导致的韵律建模效果差,获得的频谱效果也较差,进而导致语音合成效果不佳的问题。通过本实施例的方案,一方面,不再基于基频进行韵律建模,而是依据与韵律相关的多种信息进行韵律信息的提取,能够使得提取出的韵律更为准确;另一方面,综合考虑了影响韵律的多种因素(如音素、文本、目标人声的声纹等)之间的关系,也使得由此获得的韵律更为准确。Through this embodiment, the hidden prosody vector is used to represent the prosody instead of the prosody component, which avoids the poor effect of prosody modeling caused by the inaccurate fundamental frequency extraction and the lack of correlation in the prediction of each prosody component in the traditional method, and obtains The frequency spectrum effect of , which in turn leads to the problem of poor speech synthesis. Through the solution of this embodiment, on the one hand, the prosody modeling is no longer based on the fundamental frequency, but the prosody information is extracted based on various information related to the prosody, which can make the extracted prosody more accurate; on the other hand, The comprehensive consideration of the relationship between various factors affecting the prosody (such as phoneme, text, voiceprint of the target human voice, etc.) also makes the prosody thus obtained more accurate.
实施例三Embodiment three
参照图4,示出了根据本申请实施例三的一种电子设备的结构示意图,本申请具体实施例并不对电子设备的具体实现做限定。Referring to FIG. 4 , it shows a schematic structural diagram of an electronic device according to Embodiment 3 of the present application. The specific embodiment of the present application does not limit the specific implementation of the electronic device.
如图4所示,该电子设备可以包括:处理器(processor)402、通信接口(Communications Interface)404、存储器(memory)406、以及通信总线408。As shown in FIG. 4 , the electronic device may include: a processor (processor) 402, a communication interface (Communications Interface) 404, a memory (memory) 406, and a communication bus 408.
其中:in:
处理器402、通信接口404、以及存储器406通过通信总线408完成相互间的通信。The processor 402 , the communication interface 404 , and the memory 406 communicate with each other through the communication bus 408 .
通信接口404,用于与其它电子设备或服务器进行通信。The communication interface 404 is used for communicating with other electronic devices or servers.
处理器402,用于执行程序410,具体可以执行上述数据转换方法实施例中的相关步骤。The processor 402 is configured to execute the program 410, and specifically, may execute relevant steps in the foregoing data conversion method embodiments.
具体地,程序410可以包括程序代码,该程序代码包括计算机操作指令。Specifically, the program 410 may include program codes including computer operation instructions.
处理器402可能是CPU,或者是特定集成电路ASIC(Application Specific Integrated Circuit),或者是被配置成实施本申请实施例的一个或多个集成电路。智能设备包括的一个或多个处理器,可以是同一类型的处理器,如一个或多个CPU;也可以是不同类型的处理器,如一个或多个CPU以及一个或多个ASIC。The processor 402 may be a CPU, or an ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement the embodiments of the present application. The one or more processors included in the smart device may be of the same type, such as one or more CPUs, or may be different types of processors, such as one or more CPUs and one or more ASICs.
存储器406,用于存放程序410。存储器406可能包含高速RAM存储器,也可能还包括非易失性存储器(non-volatile memory),例如至少一个磁盘存储器。The memory 406 is used to store the program 410 . The memory 406 may include a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.
程序410具体可以用于使得处理器402执行上述数据转换方法实施例中任一所描述的操作。The program 410 may be specifically configured to enable the processor 402 to perform any operation described in any one of the foregoing data conversion method embodiments.
程序410中各步骤的具体实现可以参见上述上述数据转换方法实施例中的相关方法实施例中的相应步骤和单元中对应的描述,并具有相应的有益效果,在此不赘述。所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的设备和模块的具体工作过程,可以参考前述方法实施例中的对应过程描述,在此不再赘述。For the specific implementation of each step in the program 410, refer to the corresponding descriptions in the corresponding steps and units in the above-mentioned data conversion method embodiment and related method embodiments, and have corresponding beneficial effects, so details are not repeated here. Those skilled in the art can clearly understand that for the convenience and brevity of description, the specific working process of the above-described devices and modules can refer to the corresponding process description in the foregoing method embodiments, and details are not repeated here.
以及,本申请实施例还提供了一种数据转换装置,包括:And, the embodiment of the present application also provides a data conversion device, including:
获取模块,用于获取待转换文本对应的音素向量、文本向量和目标人声的声纹特征向量;The obtaining module is used to obtain the voiceprint feature vector of the phoneme vector, text vector and target human voice corresponding to the text to be converted;
所述获取模块还用于,根据所述音素向量和所述文本向量,获得所述待转换文本对应的语言学特征向量;The obtaining module is also used to obtain a linguistic feature vector corresponding to the text to be converted according to the phoneme vector and the text vector;
预测模块,用于根据所述文本向量和所述声纹特征向量,预测获得所述待转换文本的隐藏韵律矢量;A prediction module, configured to predict and obtain the hidden prosody vector of the text to be converted according to the text vector and the voiceprint feature vector;
生成模块,用于根据所述语言学特征向量、所述隐藏韵律矢量和所述声纹特征向量,生成所述待转换文本对应的语音频谱信息。A generating module, configured to generate speech spectrum information corresponding to the text to be converted according to the linguistic feature vector, the hidden prosodic vector, and the voiceprint feature vector.
在一种可能的设计中,所述文本向量为所述待转换文本中的每个字符对应的字符文本向量。In a possible design, the text vector is a character text vector corresponding to each character in the text to be converted.
在一种可能的设计中,所述数据转换方法通过韵律模型执行,所述韵律模型至少包括:音素编码网络、文本编码网络、隐藏韵律矢量预测网络、向量拼接层和解码网络;In a possible design, the data conversion method is performed by a prosody model, and the prosody model at least includes: a phoneme encoding network, a text encoding network, a hidden prosody vector prediction network, a vector splicing layer, and a decoding network;
所述音素编码网络,用于获取待转换文本对应的音素向量;The phoneme encoding network is used to obtain the phoneme vector corresponding to the text to be converted;
所述文本编码网络,用于获取待转换文本对应的文本向量;The text encoding network is used to obtain a text vector corresponding to the text to be converted;
所述隐藏韵律矢量预测网络,用于根据所述待转换文本对应的文本向量和获取的目标人声的声纹特征向量,预测获得所述待转换文本的隐藏韵律矢量;The hidden prosody vector prediction network is used to predict and obtain the hidden prosody vector of the text to be converted according to the text vector corresponding to the text to be converted and the acquired voiceprint feature vector of the target human voice;
所述向量拼接层,用于对所述音素向量和所述文本向量进行加和,获得所述待转换文本对应的语言学特征向量;以及,对所述语言学特征向量、所述隐藏韵律矢量和所述声纹特征向量进行拼接,生成拼接向量;The vector splicing layer is used to add the phoneme vector and the text vector to obtain the linguistic feature vector corresponding to the text to be converted; and, for the linguistic feature vector and the hidden prosody vector Splicing with the voiceprint feature vector to generate a splicing vector;
所述解码网络,用于对所述拼接向量进行解码,获得所述待转换文本对应的语音频谱信息。The decoding network is configured to decode the splicing vector to obtain speech spectrum information corresponding to the text to be converted.
在一种可能的设计中,所述文本编码网络包括字符编码网络和上下文编码网络;In a possible design, the text encoding network includes a character encoding network and a context encoding network;
所述字符编码网络,用于对所述待转换文本进行字符级别的编码,生成用于和所述音素向量进行加和的字符文本向量;The character encoding network is used to perform character-level encoding on the text to be converted, and generate a character text vector for adding to the phoneme vector;
所述上下文编码网络,用于对所述待转换文本进行字符级别的编码,生成用于与 所述声纹特征向量一起输入所述隐藏韵律矢量预测网络的字符文本向量。The context encoding network is used to perform character-level encoding on the text to be converted, and generate a character text vector for inputting the hidden prosody vector prediction network together with the voiceprint feature vector.
在一种可能的设计中,所述获取模块还用于:In a possible design, the acquiring module is also used for:
获取训练样本,所述训练样本包括待转换文本样本及对应的语音样本、和声纹特征样本向量,所述语音样本为频段为0-2千赫兹频段的语音样本;Obtain a training sample, the training sample includes a text sample to be converted, a corresponding voice sample, and a voiceprint feature sample vector, and the voice sample is a voice sample with a frequency band of 0-2 kHz;
所述装置还包括,处理模块;The device also includes a processing module;
所述处理模块用于,使用所述训练样本对所述韵律模型进行训练。The processing module is used to train the prosody model by using the training samples.
在一种可能的设计中,所述韵律模型还包括韵律编码网络;In a possible design, the prosodic model further includes a prosodic coding network;
所述处理模块具体用于:The processing module is specifically used for:
将所述待转换文本样本对应的音素输入音素编码网络,获得对应的音素样本向量;将所述待转换文本样本的字符输入文本编码网络,获得对应的字符样本文本向量;Input the phoneme corresponding to the text sample to be converted into the phoneme encoding network to obtain a corresponding phoneme sample vector; input the characters of the text sample to be converted into the text encoding network to obtain a corresponding character sample text vector;
将所述语音样本、所述音素样本向量、所述字符样本文本向量和所述声纹特征样本向量输入韵律编码网络,获得对应的第一隐藏韵律样本矢量;Inputting the voice sample, the phoneme sample vector, the character sample text vector and the voiceprint feature sample vector into a prosodic coding network to obtain a corresponding first hidden prosodic sample vector;
基于所述音素样本向量、所述字符样本文本向量、所述声纹特征样本向量和所述第一隐藏韵律样本矢量,对所述韵律模型进行训练。The prosody model is trained based on the phoneme sample vector, the character sample text vector, the voiceprint feature sample vector and the first hidden prosodic sample vector.
在一种可能的设计中,所述处理模块具体用于:In a possible design, the processing module is specifically used for:
将所述待转换文本的字符分别输入字符编码网络和上下文编码网络,获得对应的第一字符样本文本向量和第二字符样本文本向量;The characters of the text to be converted are respectively input into the character encoding network and the context encoding network to obtain corresponding first character sample text vectors and second character sample text vectors;
所述将所述语音样本、所述音素样本向量、所述字符样本文本向量和所述声纹特征样本向量输入韵律编码网络,获得对应的第一隐藏韵律样本矢量,包括:将所述语音样本、所述音素样本向量、所述第一字符样本文本向量和所述声纹特征样本向量输入韵律编码网络,获得对应的第一隐藏韵律样本矢量。The inputting the speech sample, the phoneme sample vector, the character sample text vector and the voiceprint feature sample vector into the prosodic coding network to obtain the corresponding first hidden prosodic sample vector includes: , the phoneme sample vector, the first character sample text vector and the voiceprint feature sample vector are input into a prosodic encoding network to obtain a corresponding first hidden prosodic sample vector.
在一种可能的设计中,所述处理模块具体用于:In a possible design, the processing module is specifically used for:
通过所述韵律编码网络的第一卷积层基于所述音素样本向量和所述声纹特征样本向量,对所述语音样本进行特征提取,获得第一韵律样本特征;Through the first convolutional layer of the prosodic coding network, based on the phoneme sample vector and the voiceprint feature sample vector, feature extraction is performed on the speech sample to obtain a first prosodic sample feature;
通过所述韵律编码网络的池化层对所述第一韵律样本特征进行字符级别的池化处理,获得字符级别的韵律样本特征;Perform character-level pooling processing on the first prosodic sample features through the pooling layer of the prosodic encoding network to obtain character-level prosodic sample features;
通过所述韵律编码网络的第二卷积层基于所述第一字符样本文本向量和所述声纹特征样本向量,对所述字符级别的韵律样本特征进行特征提取,获得第二韵律样本特征;Through the second convolutional layer of the prosodic encoding network, based on the first character sample text vector and the voiceprint feature sample vector, feature extraction is performed on the prosodic sample features at the character level to obtain second prosodic sample features;
通过所述韵律编码网络的矢量化层对所述第二韵律样本特征进行矢量化处理,获得第一隐藏韵律样本矢量。The second prosodic sample feature is vectorized by the vectorization layer of the prosodic encoding network to obtain a first hidden prosodic sample vector.
在一种可能的设计中,所述处理模块具体用于:In a possible design, the processing module is specifically used for:
将所述第二字符样本文本向量和所述声纹特征样本向量输入所述隐藏韵律矢量预测网络,预测获得第二隐藏韵律样本矢量;Inputting the second character sample text vector and the voiceprint feature sample vector into the hidden prosodic vector prediction network, and predicting and obtaining the second hidden prosodic sample vector;
根据所述第一隐藏韵律样本矢量和所述第二隐藏韵律样本矢量的差异,对所述隐藏韵律矢量预测网络进行训练。The hidden prosodic vector prediction network is trained according to the difference between the first hidden prosodic sample vector and the second hidden prosodic sample vector.
以及,本申请实施例还提供了一种数据转换装置,包括:And, the embodiment of the present application also provides a data conversion device, including:
获取模块,用于获取向智能设备发送的用户指令的响应,所述响应中包含有针对所述用户指令的待回复文本;An acquisition module, configured to acquire a response to a user instruction sent to the smart device, where the response contains text to be replied to the user instruction;
所述获取模块还用于,获取所述待回复文本对应的音素向量、文本向量和目标人声的声纹特征向量;The obtaining module is also used to obtain the phoneme vector, text vector and voiceprint feature vector of the target human voice corresponding to the text to be replied;
所述获取模块还用于,根据所述音素向量和所述文本向量,获得所述待回复文本对应的语言学特征向量;The obtaining module is also used to obtain a linguistic feature vector corresponding to the text to be replied according to the phoneme vector and the text vector;
预测模块,用于根据所述文本向量和所述声纹特征向量,预测获得所述待回复文 本的隐藏韵律矢量;A prediction module, configured to predict and obtain the hidden prosody vector of the text to be replied according to the text vector and the voiceprint feature vector;
生成模块,用于根据所述语言学特征向量、所述隐藏韵律矢量和所述声纹特征向量,生成所述待回复文本对应的语音频谱信息;A generation module, configured to generate speech spectrum information corresponding to the text to be replied according to the linguistic feature vector, the hidden prosody vector and the voiceprint feature vector;
处理模块,用于根据所述语音频谱信息生成所述待回复文本对应的语音并播放。A processing module, configured to generate and play the voice corresponding to the text to be replied according to the voice spectrum information.
以及,本申请实施例还提供了一种数据转换装置,包括:And, the embodiment of the present application also provides a data conversion device, including:
获取模块,用于获取待直播对象对应的直播剧本文本;An acquisition module, configured to acquire the live script text corresponding to the object to be broadcasted;
所述获取模块还用于,获取所述直播剧本文本对应的音素向量、文本向量和目标人声的声纹特征向量;The acquiring module is also used to acquire the phoneme vector, text vector and voiceprint feature vector of the target human voice corresponding to the live script text;
所述获取模块还用于,根据所述音素向量和所述文本向量,获得所述直播剧本文本对应的语言学特征向量;The obtaining module is also used to obtain a linguistic feature vector corresponding to the live script text according to the phoneme vector and the text vector;
预测模块,用于根据所述文本向量和所述声纹特征向量,预测获得所述直播剧本文本的隐藏韵律矢量;A prediction module, configured to predict and obtain the hidden prosody vector of the live script text according to the text vector and the voiceprint feature vector;
生成模块,用于根据所述语言学特征向量、所述隐藏韵律矢量和所述声纹特征向量,生成所述直播剧本文本对应的语音频谱信息;A generating module, configured to generate voice spectrum information corresponding to the live script text according to the linguistic feature vector, the hidden prosody vector and the voiceprint feature vector;
处理模块,用于根据所述语音频谱信息生成所述直播剧本文本对应的直播语音。A processing module, configured to generate live speech corresponding to the live script text according to the speech spectrum information.
以及,本申请实施例还提供了一种数据转换装置,包括:And, the embodiment of the present application also provides a data conversion device, including:
获取模块,用于获取待演播的剧本文本,其中,所述待演播的剧本文本包括以下之一:音频或视频对应的台词剧本、电子书文本内容;The obtaining module is used to obtain the script text to be played, wherein the script text to be played includes one of the following: line scripts corresponding to audio or video, and text content of e-books;
所述获取模块还用于,获取所述剧本文本对应的音素向量、文本向量和目标人声的声纹特征向量;The acquisition module is also used to acquire the phoneme vector, text vector and voiceprint feature vector of the target human voice corresponding to the script text;
所述获取模块还用于,根据所述音素向量和所述文本向量,获得所述剧本文本对应的语言学特征向量;The obtaining module is also used to obtain a linguistic feature vector corresponding to the script text according to the phoneme vector and the text vector;
预测模块,用于根据所述文本向量和所述声纹特征向量,预测获得所述剧本文本的隐藏韵律矢量;A prediction module, configured to predict and obtain the hidden prosody vector of the script text according to the text vector and the voiceprint feature vector;
生成模块,用于根据所述语言学特征向量、所述隐藏韵律矢量和所述声纹特征向量,生成所述剧本文本对应的语音频谱信息;A generating module, configured to generate speech spectrum information corresponding to the script text according to the linguistic feature vector, the hidden prosody vector and the voiceprint feature vector;
处理模块,用于根据所述语音频谱信息生成所述剧本文本对应的演播语音。A processing module, configured to generate a performance voice corresponding to the script text according to the voice spectrum information.
以及,本申请实施例还提供了一种数据转换装置,包括:And, the embodiment of the present application also provides a data conversion device, including:
获取模块,用于通过韵律模型的音素编码网络获取待转换文本对应的音素向量;并且,通过所述韵律模型的文本编码网络获取所述待转换文本对应的文本向量;An acquisition module, configured to acquire a phoneme vector corresponding to the text to be converted through the phoneme encoding network of the prosodic model; and acquire a text vector corresponding to the text to be converted through the text encoding network of the prosodic model;
预测模块,用于通过所述韵律模型的隐藏韵律矢量预测网络根据所述待转换文本对应的文本向量和获取的目标人声的声纹特征向量,预测获得所述待转换文本的隐藏韵律矢量;The prediction module is used to predict and obtain the hidden prosodic vector of the text to be converted through the hidden prosodic vector prediction network of the prosody model according to the text vector corresponding to the text to be converted and the acquired voiceprint feature vector of the target human voice;
所述获取模块还用于,通过所述韵律模型的向量拼接层对所述音素向量和所述文本向量进行加和,获得所述待转换文本对应的语言学特征向量;The acquisition module is also used to add the phoneme vector and the text vector through the vector splicing layer of the prosody model to obtain the linguistic feature vector corresponding to the text to be converted;
生成模块,用于对所述语言学特征向量、所述隐藏韵律矢量和所述声纹特征向量进行拼接,生成拼接向量;A generating module, configured to splice the linguistic feature vector, the hidden prosody vector, and the voiceprint feature vector to generate a stitching vector;
处理模块,用于通过所述韵律模型的解码网络对所述拼接向量进行解码,获得所述待转换文本对应的语音频谱信息。A processing module, configured to decode the concatenation vector through the decoding network of the prosody model to obtain speech spectrum information corresponding to the text to be converted.
本申请实施例还提供了一种计算机程序产品,包括计算机指令,该计算机指令指示计算设备执行上述多个方法实施例中的任一数据转换方法对应的操作。An embodiment of the present application further provides a computer program product, including computer instructions, where the computer instruction instructs a computing device to perform operations corresponding to any one of the data conversion methods in the above multiple method embodiments.
需要说明的是,本申请多个实施例中的韵律编码网络的输入均以梅尔频谱为示例,但不限于此,其它声学特征(如LPC fea、MFCC、fbank、raw wave等)也同样适用。It should be noted that the input of the prosodic coding network in multiple embodiments of the present application is exemplified by Mel spectrum, but not limited thereto, other acoustic features (such as LPC fea, MFCC, fbank, raw wave, etc.) are also applicable .
需要指出,根据实施的需要,可将本申请实施例中描述的各个部件/步骤拆分为更多部件/步骤,也可将两个或多个部件/步骤或者部件/步骤的部分操作组合成新的部件/步骤,以实现本申请实施例的目的。It should be pointed out that, according to the needs of implementation, each component/step described in the embodiment of the present application can be divided into more components/steps, and two or more components/steps or partial operations of components/steps can also be combined into New components/steps to achieve the purpose of the embodiment of the present application.
上述根据本申请实施例的方法可在硬件、固件中实现,或者被实现为可存储在记录介质(诸如CD ROM、RAM、软盘、硬盘或磁光盘)中的软件或计算机代码,或者被实现通过网络下载的原始存储在远程记录介质或非暂时机器可读介质中并将被存储在本地记录介质中的计算机代码,从而在此描述的方法可被存储在使用通用计算机、专用处理器或者可编程或专用硬件(诸如ASIC或FPGA)的记录介质上的这样的软件处理。可以理解,计算机、处理器、微处理器控制器或可编程硬件包括可存储或接收软件或计算机代码的存储组件(例如,RAM、ROM、闪存等),当所述软件或计算机代码被计算机、处理器或硬件访问且执行时,实现在此描述的数据转换方法。此外,当通用计算机访问用于实现在此示出的数据转换方法的代码时,代码的执行将通用计算机转换为用于执行在此示出的数据转换方法的专用计算机。The above method according to the embodiment of the present application can be implemented in hardware, firmware, or as software or computer code that can be stored in a recording medium (such as CD ROM, RAM, floppy disk, hard disk, or magneto-optical disk), or implemented by Computer code downloaded from a network that is originally stored on a remote recording medium or a non-transitory machine-readable medium and will be stored on a local recording medium so that the methods described herein can be stored on a computer code using a general-purpose computer, a dedicated processor, or a programmable Such software processing on a recording medium of dedicated hardware such as ASIC or FPGA. It will be appreciated that a computer, processor, microprocessor controller, or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when When accessed and executed by a processor or hardware, the data conversion methods described herein are implemented. Furthermore, when a general-purpose computer accesses the code for implementing the data conversion method shown here, the execution of the code converts the general-purpose computer into a special-purpose computer for executing the data conversion method shown here.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及方法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请实施例的范围。Those skilled in the art can appreciate that the units and method steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Professionals and technicians may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the embodiments of the present application.
以上实施方式仅用于说明本申请实施例,而并非对本申请实施例的限制,有关技术领域的普通技术人员,在不脱离本申请实施例的精神和范围的情况下,还可以做出各种变化和变型,因此所有等同的技术方案也属于本申请实施例的范畴,本申请实施例的专利保护范围应由权利要求限定。The above implementations are only used to illustrate the embodiments of the application, rather than to limit the embodiments of the application. Those of ordinary skill in the relevant technical fields can also make various implementations without departing from the spirit and scope of the embodiments of the application Changes and modifications, so all equivalent technical solutions also belong to the scope of the embodiments of the present application, and the scope of patent protection of the embodiments of the present application should be defined by the claims.

Claims (17)

  1. 一种数据转换方法,包括:A data conversion method comprising:
    获取待转换文本对应的音素向量、文本向量和目标人声的声纹特征向量;Obtain the phoneme vector, text vector and voiceprint feature vector of the target human voice corresponding to the text to be converted;
    根据所述音素向量和所述文本向量,获得所述待转换文本对应的语言学特征向量;根据所述文本向量和所述声纹特征向量,预测获得所述待转换文本的隐藏韵律矢量;Obtaining a linguistic feature vector corresponding to the text to be converted according to the phoneme vector and the text vector; predicting and obtaining a hidden prosodic vector of the text to be converted according to the text vector and the voiceprint feature vector;
    根据所述语言学特征向量、所述隐藏韵律矢量和所述声纹特征向量,生成所述待转换文本对应的语音频谱信息。Speech spectrum information corresponding to the text to be converted is generated according to the linguistic feature vector, the hidden prosody vector, and the voiceprint feature vector.
  2. 根据权利要求1所述的方法,其中,所述文本向量为所述待转换文本中的每个字符对应的字符文本向量。The method according to claim 1, wherein the text vector is a character text vector corresponding to each character in the text to be converted.
  3. 根据权利要求1或2所述的方法,其中,所述数据转换方法通过韵律模型执行,所述韵律模型至少包括:音素编码网络、文本编码网络、隐藏韵律矢量预测网络、向量拼接层和解码网络;The method according to claim 1 or 2, wherein the data conversion method is performed by a prosody model, and the prosody model at least includes: a phoneme encoding network, a text encoding network, a hidden prosody vector prediction network, a vector splicing layer and a decoding network ;
    所述音素编码网络,用于获取待转换文本对应的音素向量;The phoneme encoding network is used to obtain the phoneme vector corresponding to the text to be converted;
    所述文本编码网络,用于获取待转换文本对应的文本向量;The text encoding network is used to obtain a text vector corresponding to the text to be converted;
    所述隐藏韵律矢量预测网络,用于根据所述待转换文本对应的文本向量和获取的目标人声的声纹特征向量,预测获得所述待转换文本的隐藏韵律矢量;The hidden prosody vector prediction network is used to predict and obtain the hidden prosody vector of the text to be converted according to the text vector corresponding to the text to be converted and the acquired voiceprint feature vector of the target human voice;
    所述向量拼接层,用于对所述音素向量和所述文本向量进行加和,获得所述待转换文本对应的语言学特征向量;以及,对所述语言学特征向量、所述隐藏韵律矢量和所述声纹特征向量进行拼接,生成拼接向量;The vector splicing layer is used to add the phoneme vector and the text vector to obtain the linguistic feature vector corresponding to the text to be converted; and, for the linguistic feature vector and the hidden prosody vector Splicing with the voiceprint feature vector to generate a splicing vector;
    所述解码网络,用于对所述拼接向量进行解码,获得所述待转换文本对应的语音频谱信息。The decoding network is configured to decode the splicing vector to obtain speech spectrum information corresponding to the text to be converted.
  4. 根据权利要求3所述的方法,其中,所述文本编码网络包括字符编码网络和上下文编码网络;The method of claim 3, wherein the text encoding network comprises a character encoding network and a context encoding network;
    所述字符编码网络,用于对所述待转换文本进行字符级别的编码,生成用于和所述音素向量进行加和的字符文本向量;The character encoding network is used to perform character-level encoding on the text to be converted, and generate a character text vector for adding to the phoneme vector;
    所述上下文编码网络,用于对所述待转换文本进行字符级别的编码,生成用于与所述声纹特征向量一起输入所述隐藏韵律矢量预测网络的字符文本向量。The context encoding network is used to perform character-level encoding on the text to be converted, and generate a character text vector for inputting into the hidden prosodic vector prediction network together with the voiceprint feature vector.
  5. 根据权利要求4所述的方法,其中,所述方法还包括:The method according to claim 4, wherein the method further comprises:
    获取训练样本,所述训练样本包括待转换文本样本及对应的语音样本、和声纹特征样本向量,所述语音样本为频段为0-2千赫兹频段的语音样本;Obtain a training sample, the training sample includes a text sample to be converted, a corresponding voice sample, and a voiceprint feature sample vector, and the voice sample is a voice sample with a frequency band of 0-2 kHz;
    使用所述训练样本对所述韵律模型进行训练。The prosodic model is trained using the training samples.
  6. 根据权利要求5所述的方法,其中,所述韵律模型还包括韵律编码网络;The method according to claim 5, wherein the prosodic model further comprises a prosodic coding network;
    所述使用所述训练样本对所述韵律模型进行训练,包括:The training of the prosodic model using the training samples includes:
    将所述待转换文本样本对应的音素输入音素编码网络,获得对应的音素样本向量;将所述待转换文本样本的字符输入文本编码网络,获得对应的字符样本文本向量;Input the phoneme corresponding to the text sample to be converted into the phoneme encoding network to obtain a corresponding phoneme sample vector; input the characters of the text sample to be converted into the text encoding network to obtain a corresponding character sample text vector;
    将所述语音样本、所述音素样本向量、所述字符样本文本向量和所述声纹特征样本向量输入韵律编码网络,获得对应的第一隐藏韵律样本矢量;Inputting the voice sample, the phoneme sample vector, the character sample text vector and the voiceprint feature sample vector into a prosodic coding network to obtain a corresponding first hidden prosodic sample vector;
    基于所述音素样本向量、所述字符样本文本向量、所述声纹特征样本向量和所述第一隐藏韵律样本矢量,对所述韵律模型进行训练。The prosody model is trained based on the phoneme sample vector, the character sample text vector, the voiceprint feature sample vector and the first hidden prosodic sample vector.
  7. 根据权利要求6所述的方法,其中,所述将所述待转换文本样本的字符输入文本编码网络,获得对应的字符样本文本向量,包括:The method according to claim 6, wherein said inputting the character of said text sample to be converted into a text encoding network to obtain a corresponding character sample text vector comprises:
    将所述待转换文本的字符分别输入字符编码网络和上下文编码网络,获得对应的第一字符样本文本向量和第二字符样本文本向量;The characters of the text to be converted are respectively input into the character encoding network and the context encoding network to obtain corresponding first character sample text vectors and second character sample text vectors;
    所述将所述语音样本、所述音素样本向量、所述字符样本文本向量和所述声纹特征样本向量输入韵律编码网络,获得对应的第一隐藏韵律样本矢量,包括:将所述语音样本、所述音素样本向量、所述第一字符样本文本向量和所述声纹特征样本向量输入韵律编码网络,获得对应的第一隐藏韵律样本矢量。The inputting the speech sample, the phoneme sample vector, the character sample text vector and the voiceprint feature sample vector into the prosodic coding network to obtain the corresponding first hidden prosodic sample vector includes: , the phoneme sample vector, the first character sample text vector and the voiceprint feature sample vector are input into a prosodic encoding network to obtain a corresponding first hidden prosodic sample vector.
  8. 根据权利要求7所述的方法,其中,所述将所述语音样本、所述音素样本向量、所述第一字符样本文本向量和所述声纹特征样本向量输入韵律编码网络,获得对应的第一隐藏韵律样本矢量,包括:The method according to claim 7, wherein said inputting said voice sample, said phoneme sample vector, said first character sample text vector and said voiceprint feature sample vector into a prosody coding network to obtain the corresponding first A hidden rhythm sample vector, including:
    通过所述韵律编码网络的第一卷积层基于所述音素样本向量和所述声纹特征样本向量,对所述语音样本进行特征提取,获得第一韵律样本特征;Through the first convolutional layer of the prosodic coding network, based on the phoneme sample vector and the voiceprint feature sample vector, feature extraction is performed on the speech sample to obtain a first prosodic sample feature;
    通过所述韵律编码网络的池化层对所述第一韵律样本特征进行字符级别的池化处理,获得字符级别的韵律样本特征;Perform character-level pooling processing on the first prosodic sample features through the pooling layer of the prosodic encoding network to obtain character-level prosodic sample features;
    通过所述韵律编码网络的第二卷积层基于所述第一字符样本文本向量和所述声纹特征样本向量,对所述字符级别的韵律样本特征进行特征提取,获得第二韵律样本特征;Through the second convolutional layer of the prosodic encoding network, based on the first character sample text vector and the voiceprint feature sample vector, feature extraction is performed on the prosodic sample features at the character level to obtain second prosodic sample features;
    通过所述韵律编码网络的矢量化层对所述第二韵律样本特征进行矢量化处理,获得第一隐藏韵律样本矢量。The second prosodic sample feature is vectorized by the vectorization layer of the prosodic encoding network to obtain a first hidden prosodic sample vector.
  9. 根据权利要求7所述的方法,其中,所述基于所述音素样本向量、所述字符样本文本向量、所述声纹特征样本向量和所述第一隐藏韵律样本矢量,对所述韵律模型进行训练,包括:The method according to claim 7, wherein said prosodic model is performed based on said phoneme sample vector, said character sample text vector, said voiceprint feature sample vector and said first hidden prosodic sample vector training, including:
    将所述第二字符样本文本向量和所述声纹特征样本向量输入所述隐藏韵律矢量预测网络,预测获得第二隐藏韵律样本矢量;Inputting the second character sample text vector and the voiceprint feature sample vector into the hidden prosodic vector prediction network, and predicting and obtaining the second hidden prosodic sample vector;
    根据所述第一隐藏韵律样本矢量和所述第二隐藏韵律样本矢量的差异,对所述隐藏韵律矢量预测网络进行训练。The hidden prosodic vector prediction network is trained according to the difference between the first hidden prosodic sample vector and the second hidden prosodic sample vector.
  10. 一种数据转换方法,包括:A data conversion method comprising:
    获取向智能设备发送的用户指令的响应,所述响应中包含有针对所述用户指令的待回复文本;Obtaining a response to the user instruction sent to the smart device, the response includes text to be replied to the user instruction;
    获取所述待回复文本对应的音素向量、文本向量和目标人声的声纹特征向量;Acquiring the phoneme vector, text vector and voiceprint feature vector of the target human voice corresponding to the text to be replied;
    根据所述音素向量和所述文本向量,获得所述待回复文本对应的语言学特征向量;根据所述文本向量和所述声纹特征向量,预测获得所述待回复文本的隐藏韵律矢量;Obtaining a linguistic feature vector corresponding to the text to be replied according to the phoneme vector and the text vector; predicting and obtaining a hidden prosody vector of the text to be replied according to the text vector and the voiceprint feature vector;
    根据所述语言学特征向量、所述隐藏韵律矢量和所述声纹特征向量,生成所述待回复文本对应的语音频谱信息;generating speech spectrum information corresponding to the text to be replied according to the linguistic feature vector, the hidden prosody vector and the voiceprint feature vector;
    根据所述语音频谱信息生成所述待回复文本对应的语音并播放。Generate and play the voice corresponding to the text to be replied according to the voice spectrum information.
  11. 一种数据转换方法,包括:A data conversion method comprising:
    获取待直播对象对应的直播剧本文本;Obtain the live script text corresponding to the object to be broadcasted;
    获取所述直播剧本文本对应的音素向量、文本向量和目标人声的声纹特征向量;Obtaining the phoneme vector, text vector and voiceprint feature vector of the target human voice corresponding to the live script text;
    根据所述音素向量和所述文本向量,获得所述直播剧本文本对应的语言学特征向量;根据所述文本向量和所述声纹特征向量,预测获得所述直播剧本文本的隐藏韵律矢量;Obtaining a linguistic feature vector corresponding to the live script text according to the phoneme vector and the text vector; predicting and obtaining a hidden prosody vector of the live script text according to the text vector and the voiceprint feature vector;
    根据所述语言学特征向量、所述隐藏韵律矢量和所述声纹特征向量,生成所述直播剧本文本对应的语音频谱信息;generating voice spectrum information corresponding to the live script text according to the linguistic feature vector, the hidden prosody vector and the voiceprint feature vector;
    根据所述语音频谱信息生成所述直播剧本文本对应的直播语音。Generate live voice corresponding to the live script text according to the voice spectrum information.
  12. 一种数据转换方法,包括:A data conversion method comprising:
    获取待演播的剧本文本,其中,所述待演播的剧本文本包括以下之一:音频或视频对应的台词剧本、电子书文本内容;Obtain the script text to be played, wherein, the script text to be played includes one of the following: audio or video corresponding line script, e-book text content;
    获取所述剧本文本对应的音素向量、文本向量和目标人声的声纹特征向量;Acquiring the phoneme vector, text vector and voiceprint feature vector of the target human voice corresponding to the script text;
    根据所述音素向量和所述文本向量,获得所述剧本文本对应的语言学特征向量;根据所述文本向量和所述声纹特征向量,预测获得所述剧本文本的隐藏韵律矢量;Obtaining a linguistic feature vector corresponding to the script text according to the phoneme vector and the text vector; predicting and obtaining a hidden prosody vector of the script text according to the text vector and the voiceprint feature vector;
    根据所述语言学特征向量、所述隐藏韵律矢量和所述声纹特征向量,生成所述剧本文本对应的语音频谱信息;generating speech spectrum information corresponding to the script text according to the linguistic feature vector, the hidden prosody vector and the voiceprint feature vector;
    根据所述语音频谱信息生成所述剧本文本对应的演播语音。Generate the performance voice corresponding to the script text according to the voice spectrum information.
  13. 一种数据转换方法,包括:A data conversion method comprising:
    通过韵律模型的音素编码网络获取待转换文本对应的音素向量;并且,通过所述韵律模型的文本编码网络获取所述待转换文本对应的文本向量;Obtaining the phoneme vector corresponding to the text to be converted through the phoneme coding network of the prosodic model; and obtaining the text vector corresponding to the text to be converted through the text coding network of the prosodic model;
    通过所述韵律模型的隐藏韵律矢量预测网络根据所述待转换文本对应的文本向量和获取的目标人声的声纹特征向量,预测获得所述待转换文本的隐藏韵律矢量;Predicting and obtaining the hidden prosodic vector of the text to be converted through the hidden prosodic vector prediction network of the prosody model according to the text vector corresponding to the text to be converted and the acquired voiceprint feature vector of the target human voice;
    通过所述韵律模型的向量拼接层对所述音素向量和所述文本向量进行加和,获得所述待转换文本对应的语言学特征向量;以及,对所述语言学特征向量、所述隐藏韵律矢量和所述声纹特征向量进行拼接,生成拼接向量;Adding the phoneme vector and the text vector through the vector splicing layer of the prosody model to obtain a linguistic feature vector corresponding to the text to be converted; and, for the linguistic feature vector, the hidden prosody The vector and the voiceprint feature vector are spliced to generate a spliced vector;
    通过所述韵律模型的解码网络对所述拼接向量进行解码,获得所述待转换文本对应的语音频谱信息。The splicing vector is decoded by the decoding network of the prosody model to obtain the speech spectrum information corresponding to the text to be converted.
  14. 一种数据转换装置,包括:A data conversion device, comprising:
    获取模块,用于获取待转换文本对应的音素向量、文本向量和目标人声的声纹特征向量;The obtaining module is used to obtain the voiceprint feature vector of the phoneme vector, text vector and target human voice corresponding to the text to be converted;
    所述获取模块还用于,根据所述音素向量和所述文本向量,获得所述待转换文本对应的语言学特征向量;The obtaining module is also used to obtain a linguistic feature vector corresponding to the text to be converted according to the phoneme vector and the text vector;
    预测模块,用于根据所述文本向量和所述声纹特征向量,预测获得所述待转换文本的隐藏韵律矢量;A prediction module, configured to predict and obtain the hidden prosody vector of the text to be converted according to the text vector and the voiceprint feature vector;
    生成模块,用于根据所述语言学特征向量、所述隐藏韵律矢量和所述声纹特征向量,生成所述待转换文本对应的语音频谱信息。A generating module, configured to generate speech spectrum information corresponding to the text to be converted according to the linguistic feature vector, the hidden prosodic vector, and the voiceprint feature vector.
  15. 一种电子设备,包括:An electronic device comprising:
    存储器,用于存储程序;memory for storing programs;
    处理器,用于执行所述存储器存储的所述程序,当所述程序被执行时,所述处理器用于执行如权利要求1至13中任一所述的数据转换方法。A processor, configured to execute the program stored in the memory, and when the program is executed, the processor is configured to execute the data conversion method according to any one of claims 1 to 13.
  16. 一种计算机存储介质,其上存储有计算机程序,该程序被处理器执行时实现如权利要求1-13中任一所述的数据转换方法。A computer storage medium, on which a computer program is stored, and when the program is executed by a processor, the data conversion method according to any one of claims 1-13 is realized.
  17. 一种计算机程序产品,包括计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至13中任一所述的数据转换方法。A computer program product, comprising a computer program, characterized in that, when the computer program is executed by a processor, the data conversion method according to any one of claims 1 to 13 is implemented.
PCT/CN2022/130735 2021-12-20 2022-11-08 Data conversion method and computer storage medium WO2023116243A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111559250.5 2021-12-20
CN202111559250.5A CN113948062B (en) 2021-12-20 2021-12-20 Data conversion method and computer storage medium

Publications (1)

Publication Number Publication Date
WO2023116243A1 true WO2023116243A1 (en) 2023-06-29

Family

ID=79339324

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/130735 WO2023116243A1 (en) 2021-12-20 2022-11-08 Data conversion method and computer storage medium

Country Status (2)

Country Link
CN (1) CN113948062B (en)
WO (1) WO2023116243A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113948062B (en) * 2021-12-20 2022-08-16 阿里巴巴达摩院(杭州)科技有限公司 Data conversion method and computer storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080195391A1 (en) * 2005-03-28 2008-08-14 Lessac Technologies, Inc. Hybrid Speech Synthesizer, Method and Use
CN103117057A (en) * 2012-12-27 2013-05-22 安徽科大讯飞信息科技股份有限公司 Application method of special human voice synthesis technique in mobile phone cartoon dubbing
CN111161705A (en) * 2019-12-19 2020-05-15 上海寒武纪信息科技有限公司 Voice conversion method and device
CN111754976A (en) * 2020-07-21 2020-10-09 中国科学院声学研究所 Rhythm control voice synthesis method, system and electronic device
CN112086086A (en) * 2020-10-22 2020-12-15 平安科技(深圳)有限公司 Speech synthesis method, device, equipment and computer readable storage medium
CN112669841A (en) * 2020-12-18 2021-04-16 平安科技(深圳)有限公司 Training method and device for multilingual speech generation model and computer equipment
CN112750419A (en) * 2020-12-31 2021-05-04 科大讯飞股份有限公司 Voice synthesis method and device, electronic equipment and storage medium
CN113763920A (en) * 2020-05-29 2021-12-07 广东美的制冷设备有限公司 Air conditioner, voice generation method thereof, voice generation device and readable storage medium
CN113808571A (en) * 2021-08-17 2021-12-17 北京百度网讯科技有限公司 Speech synthesis method, speech synthesis device, electronic device and storage medium
CN113948062A (en) * 2021-12-20 2022-01-18 阿里巴巴达摩院(杭州)科技有限公司 Data conversion method and computer storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108597492B (en) * 2018-05-02 2019-11-26 百度在线网络技术(北京)有限公司 Phoneme synthesizing method and device
CN109285535A (en) * 2018-10-11 2019-01-29 四川长虹电器股份有限公司 Phoneme synthesizing method based on Front-end Design
CN110534089B (en) * 2019-07-10 2022-04-22 西安交通大学 Chinese speech synthesis method based on phoneme and prosodic structure
CN112786004A (en) * 2020-12-30 2021-05-11 科大讯飞股份有限公司 Speech synthesis method, electronic device, and storage device
CN113257221B (en) * 2021-07-06 2021-09-17 成都启英泰伦科技有限公司 Voice model training method based on front-end design and voice synthesis method
CN113506562B (en) * 2021-07-19 2022-07-19 武汉理工大学 End-to-end voice synthesis method and system based on fusion of acoustic features and text emotional features

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080195391A1 (en) * 2005-03-28 2008-08-14 Lessac Technologies, Inc. Hybrid Speech Synthesizer, Method and Use
CN103117057A (en) * 2012-12-27 2013-05-22 安徽科大讯飞信息科技股份有限公司 Application method of special human voice synthesis technique in mobile phone cartoon dubbing
CN111161705A (en) * 2019-12-19 2020-05-15 上海寒武纪信息科技有限公司 Voice conversion method and device
CN113763920A (en) * 2020-05-29 2021-12-07 广东美的制冷设备有限公司 Air conditioner, voice generation method thereof, voice generation device and readable storage medium
CN111754976A (en) * 2020-07-21 2020-10-09 中国科学院声学研究所 Rhythm control voice synthesis method, system and electronic device
CN112086086A (en) * 2020-10-22 2020-12-15 平安科技(深圳)有限公司 Speech synthesis method, device, equipment and computer readable storage medium
CN112669841A (en) * 2020-12-18 2021-04-16 平安科技(深圳)有限公司 Training method and device for multilingual speech generation model and computer equipment
CN112750419A (en) * 2020-12-31 2021-05-04 科大讯飞股份有限公司 Voice synthesis method and device, electronic equipment and storage medium
CN113808571A (en) * 2021-08-17 2021-12-17 北京百度网讯科技有限公司 Speech synthesis method, speech synthesis device, electronic device and storage medium
CN113948062A (en) * 2021-12-20 2022-01-18 阿里巴巴达摩院(杭州)科技有限公司 Data conversion method and computer storage medium

Also Published As

Publication number Publication date
CN113948062B (en) 2022-08-16
CN113948062A (en) 2022-01-18

Similar Documents

Publication Publication Date Title
JP7395792B2 (en) 2-level phonetic prosody transcription
CN111754976B (en) Rhythm control voice synthesis method, system and electronic device
US20220208170A1 (en) Generating Expressive Speech Audio From Text Data
CN108899009B (en) Chinese speech synthesis system based on phoneme
CN101578659A (en) Voice tone converting device and voice tone converting method
CN110246488B (en) Voice conversion method and device of semi-optimized cycleGAN model
CN111710326A (en) English voice synthesis method and system, electronic equipment and storage medium
WO2021212954A1 (en) Method and apparatus for synthesizing emotional speech of specific speaker with extremely few resources
CN112331177A (en) Rhythm-based speech synthesis method, model training method and related equipment
CN113053357A (en) Speech synthesis method, apparatus, device and computer readable storage medium
WO2023116243A1 (en) Data conversion method and computer storage medium
CN112185340B (en) Speech synthesis method, speech synthesis device, storage medium and electronic equipment
US9218807B2 (en) Calibration of a speech recognition engine using validated text
CN113450760A (en) Method and device for converting text into voice and electronic equipment
KR102198598B1 (en) Method for generating synthesized speech signal, neural vocoder, and training method thereof
CN117012177A (en) Speech synthesis method, electronic device, and storage medium
CN114333903A (en) Voice conversion method and device, electronic equipment and storage medium
KR102639322B1 (en) Voice synthesis system and method capable of duplicating tone and prosody styles in real time
JP7146038B2 (en) Speech recognition system and method
US11915689B1 (en) Generating audio using auto-regressive generative neural networks
Álvarez et al. A COMPARATIVE STUDY OF NEURAL MODELS FOR EMOTIONAL VOICE CONVERSION
Dinakar et al. Multispeaker and Multilingual Zero Shot Voice Cloning and Voice Conversion
de Carvalho Campinho Automatic Speech Recognition for European Portuguese
Campinho Automatic speech recognition for European Portuguese
CN116486782A (en) Text-to-speech model training method, text-to-speech method and related equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22909560

Country of ref document: EP

Kind code of ref document: A1