CN111899719B

CN111899719B - Method, apparatus, device and medium for generating audio

Info

Publication number: CN111899719B
Application number: CN202010751729.8A
Authority: CN
Inventors: 殷翔
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing ByteDance Network Technology Co Ltd
Priority date: 2020-07-30
Filing date: 2020-07-30
Publication date: 2024-07-05
Anticipated expiration: 2040-07-30
Also published as: CN111899719A

Abstract

Embodiments of the present disclosure disclose methods, apparatuses, devices, and media for generating audio. One embodiment of the method for generating audio includes: acquiring target Mandarin text information and tone information of user voice audio sent by a target user; converting the target Mandarin text information into target dialect text information corresponding to the target Mandarin text information; and generating target voice audio based on the target dialect text information and the tone information of the user voice audio, wherein the tone of the target voice audio is matched with the tone information of the user voice audio, and the target dialect text information is used for indicating the text information corresponding to the target voice audio. The embodiment can convert the Mandarin text into the dialect voice audio corresponding to the Mandarin text, and the dialect voice audio has the tone color of the voice audio sent by the target user, so that the generation mode of the voice audio is enriched.

Description

Method, apparatus, device and medium for generating audio

Technical Field

Embodiments of the present disclosure relate to the field of computer technology, and in particular, to a method, apparatus, device, and medium for generating audio.

Background

Text-To-Speech (TTS), also known as Speech synthesis, is a technique that converts textual information into an intelligible, fluent spoken output of chinese. The voice synthesis not only can help people with vision disorder read information on the computer, but also can increase the readability of the text document.

The existing general speech synthesis technology mainly records only a single-tone voice library in advance, and then makes a speech synthesis system based on the voice library. This process converts only text entered by the user into single tone voice, which is also very single in tone, mood, etc.

Disclosure of Invention

The present disclosure proposes methods, apparatus, devices and media for generating audio.

In a first aspect, embodiments of the present disclosure provide a method for generating audio, the method comprising: acquiring target Mandarin text information and tone information of user voice audio sent by a target user; converting the target Mandarin text information into target dialect text information corresponding to the target Mandarin text information; and generating target voice audio based on the target dialect text information and the tone information of the user voice audio, wherein the tone of the target voice audio is matched with the tone information of the user voice audio, and the target dialect text information is used for indicating the text information corresponding to the target voice audio.

In some embodiments, generating the target speech audio based on the target dialect text information and timbre information of the user speech audio includes: extracting text characteristic information of the target dialect text information; inputting the text characteristic information into a pre-trained encoder to obtain encoded text characteristic information; inputting the coded text characteristic information and tone information of user voice audio into a pre-trained decoder to obtain Mel frequency spectrum information; and inputting the Mel frequency spectrum information into a vocoder to obtain the target voice audio.

In some embodiments, the pre-trained encoder and pre-trained decoder are trained by: acquiring audio samples provided by different users marked with Mel frequency spectrum information; inputting the audio sample into an encoder to be trained to obtain an encoded audio sample; inputting the encoded audio samples into a text characteristic information classifier and a tone information classifier respectively to obtain classified text characteristic information and classified tone information; inputting the classified text characteristic information and the classified tone information into a decoder to be trained to obtain predicted Mel frequency spectrum information; and adjusting parameters of the encoder and the decoder according to the deviation of the marked mel spectrum information and the predicted mel spectrum information until the deviation meets the preset condition, and obtaining the encoder and the decoder after training.

In some embodiments, timbre information of the user's voice audio is derived based on audio data provided by the target user and a pre-trained timbre encoder.

In some embodiments, generating the target speech audio based on the target dialect text information and timbre information of the user speech audio includes: and generating target voice audio based on the target dialect text information, tone information of the user voice audio and target voice style information, wherein the target voice style information is used for indicating the style of the target voice audio.

In some embodiments, generating the target speech audio based on the target dialect text information, the timbre information of the user speech audio, and the target speech style information includes: extracting text characteristic information of the target dialect text information; inputting the text characteristic information into a pre-trained encoder to obtain encoded text characteristic information; inputting the coded text characteristic information, tone information of user voice and target voice style information into a pre-trained decoder to obtain Mel frequency spectrum information; and inputting the Mel frequency spectrum information into a vocoder to obtain the target voice audio.

In some embodiments, the target speech style information is obtained by: acquiring voice audio of a person with a voice style indicated by the target voice style information; the person's voice audio is input to a pre-trained voice style encoder to generate target voice style information.

In a second aspect, embodiments of the present disclosure provide an apparatus for generating audio, the apparatus comprising: the acquisition unit is configured to acquire target Mandarin text information and tone information of user voice audio sent by a target user; a conversion unit configured to convert the target mandarin chinese text information into target dialect text information corresponding thereto; the generating unit is configured to generate target voice audio based on target dialect text information and tone information of user voice audio, wherein the tone of the target voice audio is matched with the tone information of the user voice audio, and the target dialect text information is used for indicating text information corresponding to the target voice audio.

In a third aspect, embodiments of the present disclosure provide an electronic device for generating audio, comprising: one or more processors; and a storage device having one or more programs stored thereon, which when executed by the one or more processors, cause the one or more processors to implement a method as in any of the embodiments of the method for generating audio described above.

In a fourth aspect, embodiments of the present disclosure provide a computer-readable medium for generating audio, having stored thereon a computer program which, when executed by a processor, implements a method as in any of the embodiments of the method for generating audio described above.

The embodiment of the disclosure provides a method, a device, equipment and a medium for generating audio, which are used for acquiring target Mandarin text information and tone information of user voice audio sent by a target user; converting the target Mandarin text information into target dialect text information corresponding to the target Mandarin text information; and generating target voice audio based on the target dialect text information and the tone information of the user voice audio, wherein the tone of the target voice audio is matched with the tone information of the user voice audio, and the target dialect text information is used for indicating the text information corresponding to the target voice audio, so that the conversion of the Mandarin text into the dialect voice audio corresponding to the dialect text can be realized, the dialect voice audio has the tone of the voice audio sent by the target user, and the generation mode of the voice audio is enriched.

Drawings

Other features, objects and advantages of the present disclosure will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings:

FIG. 1 is an exemplary system architecture diagram in which an embodiment of the present disclosure may be applied;

FIG. 2 is a flow chart of one embodiment of a method for generating audio according to the present disclosure;

FIG. 3 is a schematic illustration of one application scenario of a method for generating audio according to the present disclosure;

FIG. 4 is a flow chart of yet another embodiment of a method for generating audio according to the present disclosure;

FIG. 5 is a schematic structural diagram of one embodiment of an apparatus for generating audio according to the present disclosure;

Fig. 6 is a schematic diagram of a computer system suitable for use in implementing embodiments of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the present disclosure and not limiting thereof. It should be further noted that, for convenience of description, only a portion relevant to the present disclosure is shown in the drawings.

It should be noted that, without conflict, the embodiments of the present disclosure and features of the embodiments may be combined with each other. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of methods for generating audio or apparatuses for generating audio of embodiments of the present disclosure may be applied.

As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or transmit data (e.g., target Mandarin text information and timbre information of user voice audio uttered by the target user), etc. Various client applications, such as audio playing software, music processing applications, news information class applications, image processing class applications, web browser applications, shopping class applications, search class applications, instant messaging tools, mailbox clients, social platform software, etc., may be installed on the terminal devices 101, 102, 103.

The terminal devices 101, 102, 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices having information processing functions, including but not limited to smart phones, tablet computers, laptop and desktop computers, and the like. When the terminal devices 101, 102, 103 are software, they can be installed in the above-listed electronic devices. Which may be implemented as a plurality of software or software modules (e.g., software or software modules for providing a generated audio service) or as a single software or software module. The present invention is not particularly limited herein.

The server 105 may be a server providing various services, such as a background audio processing server generating target voice audio based on mandarin text information provided by the target user and user voice audio uttered by the target user transmitted by the terminal devices 101, 102, 103. Optionally, the background audio processing server may further feed the generated target voice audio back to the terminal device for playing by the terminal device. As an example, the server 105 may be a cloud server.

The server may be hardware or software. When the server is hardware, the server may be implemented as a distributed server cluster formed by a plurality of servers, or may be implemented as a single server. When the server is software, it may be implemented as a plurality of software or software modules (e.g., software or software modules for providing a generated audio service) or as a single software or software module. The present invention is not particularly limited herein.

It should also be noted that, the method for generating audio provided by the embodiments of the present disclosure may be performed by a server, may be performed by a terminal device, or may be performed by the server and the terminal device in cooperation with each other. Accordingly, each part (for example, each unit, sub-unit, module, sub-module) included in the apparatus for generating audio may be all disposed in the server, may be all disposed in the terminal device, or may be disposed in the server and the terminal device, respectively.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. When the electronic device on which the method for generating audio is run does not need to perform data transmission with other electronic devices, the system architecture may include only the electronic device (e.g., server or terminal device) on which the method for generating audio is run.

With continued reference to fig. 2, a flow 200 of one embodiment of a method for generating audio according to the present disclosure is shown. The method for generating audio comprises the following steps:

Step 201, obtaining target Mandarin text information and tone information of user voice audio sent by a target user.

In this embodiment, the execution subject of the method for generating audio (such as the server or the terminal device shown in fig. 1) may obtain the target mandarin text information and the tone information of the user voice audio sent by the target user locally from other electronic devices through a wired connection or a wireless connection.

Wherein the target user may be any user. The user speech audio may be audio of any speech uttered by the target user. For example, the user voice audio may be the audio of a song that the target user sings, or may be the audio of a voice that the target user utters during the conversation.

Here, tone information of the user voice audio may be obtained by a tone information generation model trained in advance. The tone information generation model can be obtained through training based on voice audio samples marked with tone information.

Specifically, the execution subject may input the above-described user voice audio to a pre-trained tone information generation model, generating tone information of the user voice audio.

Alternatively, the tone color information generation model may be a model trained by an unsupervised machine learning algorithm.

In some alternatives, the timbre information of the user's voice audio is derived based on audio data provided by the target user and a pre-trained timbre encoder.

In this implementation, the timbre encoder is used to capture timbre characteristics of voice audio data provided by the input target user, which are independent of corresponding text characteristics of the voice audio and unique speaker style characteristics, and the output of the pre-trained timbre encoder can be embodied in the form of an embedded vector.

According to the implementation mode, the tone information of the voice audio of the user is obtained through the audio data provided by the target user and the pre-trained tone encoder, the tone characteristics of the voice audio can be better captured, and the accuracy of the obtained tone information of the voice audio of the user is further improved.

Step 202, converting the Mandarin text information into corresponding target dialect text information.

In this embodiment, the execution subject converts the acquired target mandarin chinese text information into target dialect text information corresponding thereto.

The target dialect text information may be any dialect text information, for example, a shanxi dialect, a hunan dialect, a si dialect, and the like, which is not limited in the present application.

Here, the manner in which the execution subject converts the target mandarin chinese text information into the target dialect text information corresponding thereto may be to convert the target mandarin chinese text information into the target dialect text information corresponding thereto using a pre-trained dialect text conversion model, where the dialect text conversion model is obtained based on mandarin text sample training labeled with the corresponding dialect text information; the preset dialect text information and the Mandarin text information comparison table can be utilized to search the dialect text information corresponding to the Mandarin text information in the comparison table, and the application is not limited to this.

As an example, if the mandarin chinese text information is "open that drawer", its corresponding quassian text information is "open that drawer".

In step 203, target voice audio is generated based on the target dialect text information and tone information of the user voice audio.

In this embodiment, the execution body may generate the target voice audio from the target dialect text information and tone information of the user voice audio. The tone color of the target voice audio is matched with tone color information of the user voice audio, and the target dialect text information is used for indicating text information corresponding to the target voice audio.

As an example, the execution subject may input target dialect text information and tone color information of user voice audio to a pre-trained audio generation model to generate target voice audio. The audio generation model can be obtained based on sample data with target dialect text information marked with voice audio and tone information of user voice audio.

In some alternatives, generating the target speech audio based on the target dialect text information and tone information of the user speech audio includes: and generating target voice audio based on the target dialect text information, tone information of the user voice audio and target voice style information.

In this implementation manner, the execution subject may generate the target voice audio based on the target dialect text information, the tone information of the user voice audio, and the target voice style information acquired in the above steps. The voice style of the target voice audio is the voice style indicated by the target voice style information.

Here, the speech style information (including the target speech style information) may characterize the style of speech, e.g., the speech style information may characterize at least one of: speed, rhythm, intonation, accent, biting.

The execution subject may input the voice audio to a pre-trained voice style information generation model to obtain voice style information of the voice audio. The speech style information generation model can be trained based on speech audio samples marked with speech style information.

According to the implementation mode, the target voice audio is generated based on the target dialect text information, the tone information of the user voice audio and the target voice style information, so that the generated dialect audio can further have style information of a speaker of the dialect audio, and the naturalness and fluency of the generated dialect audio are further improved.

In some alternatives, the target speech style information is obtained by: acquiring voice audio of a person with a voice style indicated by the target voice style information; the person's voice audio is input to a pre-trained voice style encoder to generate target voice style information.

In this implementation, the executing body may acquire the voice audio of the person having the voice style indicated by the target voice style information, and input the voice audio of the person to the pre-trained voice style encoder to obtain the target voice style information.

The speech style encoder is used for capturing style characteristics of input speech audio, the style characteristics are independent of text characteristics corresponding to the speech audio and unique speaker tone characteristics, and output of the pre-trained style encoder can be embodied in the form of embedded vectors.

As an example, if the target voice audio is a dialect voice audio, the executing body may use style information of a voice audio of a dialect as the target voice style information, further obtain the dialect voice audio provided by the dialect, and input the dialect voice audio into the pre-trained voice style encoder to obtain the target voice style information.

The implementation is by obtaining voice audio of a person having a voice style indicated by the target voice style information; the voice audio of the person is input into a pre-trained voice style encoder to generate target voice style information, so that style characteristics of the voice audio can be captured better, and accuracy of the acquired target voice style information is improved.

With continued reference to fig. 3, fig. 3 is a schematic diagram of one application scenario of the method for generating audio according to the present embodiment. In the application scenario of fig. 3, the server 301 first acquires the target mandarin chinese text information 302 (e.g., open that drawer) and the timbre information 303 of the user voice audio sent by the target user, then converts the acquired target mandarin chinese text information into target dialect text information 304 (e.g., open that drawer), and finally generates target voice audio 305 (e.g., dialect audio with timbre of the target user voice audio) based on the target dialect text information and the timbre information of the user voice audio, wherein the timbre of the target voice audio 305 matches with the timbre information 303 of the user voice audio, and the target dialect text information 304 is used to indicate the text information corresponding to the target voice audio.

According to the method provided by the embodiment of the disclosure, the target Mandarin text information and tone information of user voice audio sent by a target user are obtained; converting the target Mandarin text information into target dialect text information corresponding to the target Mandarin text information; and generating target voice audio based on the target dialect text information and the tone information of the user voice audio, wherein the tone of the target voice audio is matched with the tone information of the user voice audio, the target dialect text information is used for indicating the text information corresponding to the target voice audio, the Mandarin text is converted into the dialect voice audio corresponding to the dialect text, and the dialect voice audio has the tone of the voice audio sent by the target user, so that the generation mode of the voice audio is enriched.

With further reference to fig. 4, a flow 400 of yet another embodiment of a method for generating audio is shown. The flow 400 of the method for generating audio comprises the steps of:

Step 401, obtaining target Mandarin text information and tone information of user voice audio sent by a target user.

In this embodiment, step 401 is substantially identical to step 201 in the corresponding embodiment of fig. 2, and will not be described herein.

Step 402, converting the target Mandarin text information into target dialect text information corresponding to the target Mandarin text information.

In this embodiment, step 402 is substantially identical to step 202 in the corresponding embodiment of fig. 2, and will not be described herein.

Step 403, extracting text feature information of the target Mandarin text information.

In this embodiment, the execution body may extract text feature information of the target mandarin chinese text information by performing text analysis on the target mandarin chinese text information.

The text feature information may include: phonemes, tones, word breaks, prosody, phrases, etc.

Specifically, the text analysis process may include: converting the target Mandarin text information into phonemes corresponding to the target Mandarin text information through G2P (Grapheme-to-Phoneme), determining the word segmentation corresponding to the target Mandarin text information through a word segmentation prediction model, and determining the prosodic phrase corresponding to the target Mandarin text information through a prosodic phrase prediction model, wherein the prosodic phrase prediction model can be obtained through training based on text information samples marked with the prosodic phrase, and the word segmentation prediction model can be obtained through training based on text information samples marked with the word segmentation.

Here, the prosodic phrase prediction model and the segmentation prediction model may be implemented based on a statistical machine learning algorithm, for example, an HMM (Hidden Markov Model ) algorithm, an SVM (Support Vector Machine, support vector machine) algorithm, etc., to which the present application is not limited.

Step 404, inputting the text feature information into a pre-trained encoder to obtain encoded text feature information.

In this embodiment, the pre-trained encoder is used to encode text feature information, and the encoder may be implemented based on a convolutional neural network in the prior art or in future, for example, CNN (Convolutional Neural Networks, convolutional neural network), LSTM (Long Short-Term Memory network), GRU (Gated Recurrent Unit, gating loop unit), BGRU (Bidirectional Gated Recurrent Unit, bi-directional gating loop unit), and the like, which is not limited in this application.

Preferably, the encoder may be implemented based on a hybrid neural network of cnn+ BGRU, which may learn deeper text features through convolutional layers.

Step 405, inputting the encoded text feature information and tone information of the user voice audio into a pre-trained decoder to obtain mel spectrum information.

In this embodiment, the execution subject may input the encoded text feature information and timbre information of the user voice audio to the pre-trained decoder to obtain mel spectrum information.

Here, the pre-trained decoder may use an autoregressive decoder or a non-autoregressive decoder, and compared with other decoders, the autoregressive decoder can better utilize the dependence characteristics of the voice audio on different time scales, so as to improve the generation quality of the target voice audio.

Furthermore, it should be noted that the encoder and decoder can be directly connected through the attention mechanism model.

Step 406, the mel spectrum information is input into the vocoder to obtain the target voice audio.

In this embodiment, the execution subject inputs mel spectrum information obtained in the above steps into the vocoder to obtain the target voice audio. Wherein the vocoder is used for representing the corresponding relation between the mel frequency spectrum information and the voice audio.

In some alternatives, the pre-trained encoder and pre-trained decoder are trained by: acquiring audio samples provided by different users marked with Mel frequency spectrum information; inputting the audio sample into an encoder to be trained to obtain an encoded audio sample; inputting the encoded audio samples into a text characteristic information classifier and a tone information classifier respectively to obtain classified text characteristic information and classified tone information; inputting the classified text characteristic information and the classified tone information into a decoder to be trained to obtain predicted Mel frequency spectrum information; and adjusting parameters of the encoder and the decoder according to the deviation of the marked mel spectrum information and the predicted mel spectrum information until the deviation meets the preset condition, and obtaining the trained encoder and decoder.

In this implementation, the pre-trained encoder and pre-trained decoder are trained by: first, audio samples provided by different users labeled with mel-frequency spectrum information are obtained. Here, the audio samples provided by different users may be different dialect audio samples.

As an example, the audio samples of different dialects may include mandarin audio provided by the target user and 5 dialect speakers, one dialect audio provided by each, such as shanxi, hunan, sichuan, northeast, guangdong, each for two hours.

And then inputting the audio sample into an encoder to be trained to obtain an encoded audio sample, and respectively inputting the encoded audio sample into a text characteristic information classifier and a tone information classifier to obtain classified text characteristic information and classified tone information.

Here, the text feature information classifier is used to ensure that the encoder encodes only text feature information by means of supervised learning. The tone information classifier is used for ensuring that the encoder only encodes tone characteristic information in a supervised learning mode.

Further, the execution body may input the classified text feature information and the classified tone information into the decoder to be trained in a combined manner to obtain predicted mel spectrum information, calculate deviation between the predicted mel spectrum information and the marked mel spectrum information, and adjust parameters of the encoder and the decoder according to the deviation until the deviation meets a preset condition, thereby obtaining the encoder and the decoder after training.

In addition, it should be noted that the encoded audio samples may also be input into a style information classifier. The style information classifier is used for ensuring that the encoder only encodes style characteristic information in a supervised learning mode.

Further, the execution body may combine the classified text feature information, the classified tone information and the classified style information and input the combined text feature information, the classified tone information and the classified style information to a decoder to be trained to obtain predicted mel spectrum information, calculate a deviation between the predicted mel spectrum information and the labeled mel spectrum information, and adjust parameters of the encoder and the decoder according to the deviation until the deviation meets a preset condition, thereby obtaining the encoder and the decoder after training.

According to the implementation mode, the encoder and the decoder to be trained are trained based on the audio samples provided by different users marked with the Mel frequency spectrum information, so that the trained encoder and decoder can learn the characteristics of text characteristic information and tone information of different dialect audios, and the generalization capability of the trained decoder and encoder is improved.

In some alternatives, generating the target speech audio based on the target dialect text information, the timbre information of the user speech audio, and the target speech style information includes: extracting text characteristic information of the target dialect text information; inputting the text characteristic information into a pre-trained encoder to obtain encoded text characteristic information; inputting the coded text characteristic information, tone information of user voice and target voice style information into a pre-trained decoder to obtain Mel frequency spectrum information; and inputting the Mel frequency spectrum information into a vocoder to obtain the target voice audio.

In this implementation manner, the execution body may extract text feature information of the target mandarin chinese text information by performing text analysis on the target mandarin chinese text information.

Then, inputting the text characteristic information into a pre-trained encoder to obtain coded text characteristic information, and inputting the coded text characteristic information, tone information of user voice audio and target voice style information into a pre-trained decoder to obtain Mel frequency spectrum information.

Here, tone information of the user voice audio may be obtained based on a pre-trained tone encoder, and target voice style information may be obtained based on a pre-trained style encoder.

And finally, the executive main body inputs the Mel frequency spectrum information obtained in the steps into a vocoder to obtain the target voice audio. Wherein the vocoder is used for representing the corresponding relation between the mel frequency spectrum information and the voice audio.

According to the implementation mode, through the decoder pre-trained by the coded text characteristic information, the tone information of the voice audio and the target style information, the Mel frequency spectrum information is obtained, and the Mel frequency spectrum information is input into the vocoder to obtain the target voice audio, so that the target voice audio fully combines the target dialect text information, the tone information of the voice audio and the target style information, and the vocoder (vocoder) is adopted to generate the target voice audio, the accuracy of the generated target voice audio can be improved, and the generated target voice audio is closer to the real voice audio, so that the synthesis effect is more natural.

As can be seen from fig. 4, compared with the embodiment corresponding to fig. 2, the flow 400 of the method for generating audio in this embodiment highlights the steps of encoding text feature information by an encoder, decoding the encoded text feature information and timbre information of user voice audio by a decoder, and obtaining target voice audio using a vocoder. Therefore, the scheme described in the embodiment can enable the target voice audio to fully combine the target dialect text information and tone information of voice audio, and the vocoder (vocoder) is adopted to generate the target voice audio, so that the accuracy of the generated target voice audio can be improved, and the generated target voice audio is more similar to real voice audio.

With further reference to fig. 5, as an implementation of the method shown in the above figures, the present disclosure provides an embodiment of an apparatus for generating audio, which corresponds to the method embodiment shown in fig. 2, and which may include the same or corresponding features as the method embodiment shown in fig. 2, in addition to the features described below, and produces the same or corresponding effects as the method embodiment shown in fig. 2. The device can be applied to various electronic equipment.

As shown in fig. 5, the apparatus 500 for generating audio of the present embodiment includes: an obtaining unit 501 configured to obtain target mandarin chinese text information and tone information of user voice audio uttered by a target user; a conversion unit 502 configured to convert target mandarin chinese text information into target dialect text information corresponding thereto; the generating unit 503 is configured to generate target voice audio based on the target dialect text information and tone information of the user voice audio, where the tone of the target voice audio matches with the tone information of the user voice audio, and the target dialect text information is used to indicate the text information corresponding to the target voice audio.

In this embodiment, the obtaining unit 501 of the apparatus 500 for generating audio may obtain the target voice style information and the user voice audio sent by the target user from other electronic devices through a wired connection manner or a wireless connection manner.

In this embodiment, the generating unit 503 may generate the target voice audio based on the tone information of the user voice audio sent by the target user and acquired by the acquiring unit 501 and the target dialect text information acquired by the converting unit 502. The tone color of the target voice audio is matched with tone color information of the user voice audio, and the target dialect text information is used for indicating text information corresponding to the target voice audio.

In some optional implementations of the present embodiment, the generating unit 503 includes: a first generation subunit (not shown in the figure) configured to extract text feature information of the target dialect text information; a second generation subunit (not shown in the figure) configured to input the text feature information into a pre-trained encoder to obtain encoded text feature information; a third generation subunit (not shown in the figure) configured to input the encoded text feature information and tone information of the user voice audio into a pre-trained decoder to obtain mel spectrum information; the fourth generation subunit (not shown) is configured to input mel-frequency spectrum information into the vocoder to obtain the target voice audio.

In some alternative implementations of the present embodiment, the pre-trained encoder and pre-trained decoder are trained by: acquiring audio samples provided by different users marked with Mel frequency spectrum information; inputting the audio sample into an encoder to be trained to obtain an encoded audio sample; inputting the encoded audio samples into a text characteristic information classifier and a tone information classifier respectively to obtain classified text characteristic information and classified tone information; inputting the classified text characteristic information and the classified tone information into a decoder to be trained to obtain predicted Mel frequency spectrum information; and adjusting parameters of the encoder and the decoder according to the deviation of the marked mel spectrum information and the predicted mel spectrum information until the deviation meets the preset condition, and obtaining the trained encoder and decoder.

In some alternative implementations of the present embodiment, the timbre information of the user's voice audio is derived based on audio data provided by the target user and the pre-trained timbre encoder.

In some optional implementations of this embodiment, the generating unit is further configured to generate the target speech audio based on the target dialect text information, timbre information of the user speech audio, and target speech style information, the target speech style information being used to indicate a style of the target speech audio.

In some optional implementations of this embodiment, generating the target speech audio based on the target dialect text information, the timbre information of the user speech audio, and the target speech style information includes: extracting text characteristic information of the target dialect text information; inputting the text characteristic information into a pre-trained encoder to obtain encoded text characteristic information; inputting the coded text characteristic information, tone information of user voice and target voice style information into a pre-trained decoder to obtain Mel frequency spectrum information; and inputting the Mel frequency spectrum information into a vocoder to obtain the target voice audio.

In some optional implementations of this embodiment, the target speech style information is obtained by: acquiring voice audio of a person with a voice style indicated by the target voice style information; the person's voice audio is input to a pre-trained voice style encoder to generate target voice style information.

In the apparatus provided in the foregoing embodiment of the present disclosure, the obtaining unit 501 obtains the target mandarin chinese text information and the tone color information of the user voice audio sent by the target user, the converting unit 502 converts the target mandarin chinese text information into target dialect text information corresponding thereto, and then the generating unit 503 generates the target voice audio based on the target dialect text information and the tone color information of the user voice audio, where the tone color of the target voice audio is matched with the tone color information of the user voice audio, and the target dialect text information is used to indicate the text information corresponding to the target voice audio.

Referring now to fig. 6, a schematic diagram of an electronic device (e.g., server or terminal device of fig. 1) 600 suitable for use in implementing embodiments of the present disclosure is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), car terminals (e.g., car navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The terminal device/server illustrated in fig. 6 is merely an example, and should not impose any limitation on the functionality and scope of use of embodiments of the present disclosure.

As shown in fig. 6, the electronic device 600 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 601, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

In general, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, magnetic tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 shows an electronic device 600 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead. Each block shown in fig. 6 may represent one device or a plurality of devices as needed.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via communication means 609, or from storage means 608, or from ROM 602. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing means 601.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including an object oriented programming language such as Python, java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In accordance with one or more embodiments of the present disclosure, the present disclosure provides a method for generating audio, the method comprising: acquiring target Mandarin text information and tone information of user voice audio sent by a target user; converting the target Mandarin text information into target dialect text information corresponding to the target Mandarin text information; and generating target voice audio based on the target dialect text information and the tone information of the user voice audio, wherein the tone of the target voice audio is matched with the tone information of the user voice audio, and the target dialect text information is used for indicating the text information corresponding to the target voice audio.

In accordance with one or more embodiments of the present disclosure, in a method for generating audio provided by the present disclosure, target voice audio is generated based on target dialect text information and timbre information of user voice audio, including: extracting text characteristic information of the target dialect text information; inputting the text characteristic information into a pre-trained encoder to obtain encoded text characteristic information; inputting the coded text characteristic information and tone information of user voice audio into a pre-trained decoder to obtain Mel frequency spectrum information; and inputting the Mel frequency spectrum information into a vocoder to obtain the target voice audio.

In accordance with one or more embodiments of the present disclosure, the present disclosure provides a method for generating audio, wherein a pre-trained encoder and a pre-trained decoder are trained by: acquiring audio samples provided by different users marked with Mel frequency spectrum information; inputting the audio sample into an encoder to be trained to obtain an encoded audio sample; inputting the encoded audio samples into a text characteristic information classifier and a tone information classifier respectively to obtain classified text characteristic information and classified tone information; inputting the classified text characteristic information and the classified tone information into a decoder to be trained to obtain predicted Mel frequency spectrum information; and adjusting parameters of the encoder and the decoder according to the deviation of the marked mel spectrum information and the predicted mel spectrum information until the deviation meets the preset condition, and obtaining the encoder and the decoder after training.

In accordance with one or more embodiments of the present disclosure, in a method for generating audio provided by the present disclosure, timbre information of user speech audio is obtained based on audio data provided by a target user and a pre-trained timbre encoder.

In accordance with one or more embodiments of the present disclosure, in a method for generating audio provided by the present disclosure, target voice audio is generated based on target dialect text information and timbre information of user voice audio, including: and generating target voice audio based on the target dialect text information, tone information of the user voice audio and target voice style information, wherein the target voice style information is used for indicating the style of the target voice audio.

In accordance with one or more embodiments of the present disclosure, in a method for generating audio provided by the present disclosure, generating target voice audio based on the target dialect text information, timbre information of user voice audio, and target voice style information, includes: extracting text characteristic information of the target dialect text information; inputting the text characteristic information into a pre-trained encoder to obtain encoded text characteristic information; inputting the coded text characteristic information, tone information of user voice and target voice style information into a pre-trained decoder to obtain Mel frequency spectrum information; and inputting the Mel frequency spectrum information into a vocoder to obtain the target voice audio.

In accordance with one or more embodiments of the present disclosure, in a method for generating audio provided by the present disclosure, target speech style information is obtained by: acquiring voice audio of a person with a voice style indicated by the target voice style information; the person's voice audio is input to a pre-trained voice style encoder to generate target voice style information.

In accordance with one or more embodiments of the present disclosure, the present disclosure provides an apparatus for generating audio, the apparatus comprising: the acquisition unit is configured to acquire target Mandarin text information and tone information of user voice audio sent by a target user; a conversion unit configured to convert target mandarin chinese text information into target dialect text information corresponding thereto; the generating unit is configured to generate target voice audio based on target dialect text information and tone information of user voice audio, wherein the tone of the target voice audio is matched with the tone information of the user voice audio, and the target dialect text information is used for indicating text information corresponding to the target voice audio.

In an apparatus for generating audio provided by the present disclosure according to one or more embodiments of the present disclosure, the generating unit includes: a first generation subunit configured to extract text feature information of the target dialect text information; a second generation subunit configured to input the text feature information into a pre-trained encoder to obtain encoded text feature information; a third generation subunit configured to input the encoded text feature information and tone information of the user voice audio to a pre-trained decoder to obtain mel spectrum information; the fourth generation subunit is configured to input mel-frequency spectrum information into the vocoder to obtain the target voice audio.

In accordance with one or more embodiments of the present disclosure, in an apparatus for generating audio provided by the present disclosure, a pre-trained encoder and a pre-trained decoder are trained by: acquiring audio samples provided by different users marked with Mel frequency spectrum information; inputting the audio sample into an encoder to be trained to obtain an encoded audio sample; inputting the encoded audio samples into a text characteristic information classifier and a tone information classifier respectively to obtain classified text characteristic information and classified tone information; inputting the classified text characteristic information and the classified tone information into a decoder to be trained to obtain predicted Mel frequency spectrum information; and adjusting parameters of the encoder and the decoder according to the deviation of the marked mel spectrum information and the predicted mel spectrum information until the deviation meets the preset condition, and obtaining the trained encoder and decoder.

In accordance with one or more embodiments of the present disclosure, in an apparatus for generating audio provided by the present disclosure, timbre information of user speech audio is obtained based on audio data provided by a target user and a pre-trained timbre encoder.

According to one or more embodiments of the present disclosure, in the apparatus for generating audio provided by the present disclosure, the generating unit is further configured to generate the target voice audio based on the target dialect text information, timbre information of the user voice audio, and target voice style information for indicating a style of the target voice audio.

In an apparatus for generating audio provided in the present disclosure, according to one or more embodiments of the present disclosure, generating target voice audio based on target dialect text information, timbre information of user voice audio, and target voice style information includes: extracting text characteristic information of the target dialect text information; inputting the text characteristic information into a pre-trained encoder to obtain encoded text characteristic information; inputting the coded text characteristic information, tone information of user voice and target voice style information into a pre-trained decoder to obtain Mel frequency spectrum information; and inputting the Mel frequency spectrum information into a vocoder to obtain the target voice audio.

In an apparatus for generating audio provided in the present disclosure, according to one or more embodiments of the present disclosure, target speech style information is obtained by: acquiring voice audio of a person with a voice style indicated by the target voice style information; the person's voice audio is input to a pre-trained voice style encoder to generate target voice style information.

The units involved in the embodiments described in the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The described units may also be provided in a processor, for example, described as: a processor includes an acquisition unit and a generation unit. The names of these units do not limit the unit itself in some cases, and the acquisition unit may also be described as "a unit that acquires target mandarin text information and tone information of user voice audio uttered by a target user", for example.

As another aspect, the present disclosure also provides a computer-readable medium that may be contained in the electronic device described in the above embodiments; or may exist alone without being incorporated into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring target Mandarin text information and tone information of user voice audio sent by a target user; converting the target Mandarin text information into target dialect text information corresponding to the target Mandarin text information; and generating target voice audio based on the target dialect text information and the tone information of the user voice audio, wherein the tone of the target voice audio is matched with the tone information of the user voice audio, and the target dialect text information is used for indicating the text information corresponding to the target voice audio.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by those skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combination of features described above, but encompasses other embodiments in which any combination of features described above or their equivalents is contemplated without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Claims

1. A method for generating audio, comprising:

acquiring target Mandarin text information and tone information of user voice audio sent by a target user;

Converting the target Mandarin text information into target dialect text information corresponding to the target Mandarin text information;

Generating target voice audio based on the target dialect text information and tone information of the user voice audio, wherein the tone of the target voice audio is matched with the tone information of the user voice audio, and the target dialect text information is used for indicating text information corresponding to the target voice audio;

the target voice audio is generated based on a pre-trained encoder and a pre-trained decoder, the pre-trained encoder comprises a pre-trained text encoder, the pre-trained text encoder is used for encoding text characteristic information of the target dialect text information, and the pre-trained decoder is used for decoding the encoded text characteristic information and tone information of the user voice audio to obtain information for obtaining the target voice audio;

The training samples corresponding to the pre-trained encoder and the pre-trained decoder are audio samples provided by different users marked with information for obtaining the target voice audio, the training samples are used for being input into an audio encoder to be trained, the audio encoder to be trained is used for outputting encoded audio samples, the encoded audio samples comprise tone information and text characteristic information, the text characteristic information is used for training the text encoder to be trained, and the encoded audio samples are used for training the decoder to be trained.

2. The method of claim 1, wherein the generating target voice audio based on the target dialect text information and timbre information of user voice audio comprises:

extracting text characteristic information of the target dialect text information;

Inputting the text characteristic information into the pre-trained text encoder to obtain encoded text characteristic information;

Inputting the coded text characteristic information and tone information of user voice audio into a pre-trained decoder to obtain Mel frequency spectrum information;

And inputting the Mel frequency spectrum information into a vocoder to obtain the target voice audio.

3. The method of claim 2, wherein the pre-trained encoder further comprises: a timbre information classifier, the pre-trained encoder and pre-trained decoder being trained by:

acquiring audio samples provided by different users marked with Mel frequency spectrum information;

inputting the audio sample into an audio encoder to be trained to obtain an encoded audio sample, wherein the encoded audio sample comprises tone information and text characteristic information;

inputting the encoded audio samples into the text encoder to be trained and the tone information classifier respectively to obtain encoded text characteristic information and classified tone information;

inputting the coded text characteristic information and the classified tone information into a decoder to be trained to obtain predicted Mel frequency spectrum information;

And adjusting parameters of the tone information classifier and the decoder to be trained according to deviation of the marked Mel frequency spectrum information and the predicted Mel frequency spectrum information until the deviation meets a preset condition, and obtaining the pre-trained decoder.

4. The method of claim 1, wherein the timbre information of the user speech audio is derived based on the user speech audio provided by a target user and a pre-trained timbre encoder.

5. The method of claim 1, wherein the generating target voice audio based on the target dialect text information and timbre information of user voice audio comprises:

And generating target voice audio based on the target dialect text information, the tone information of the user voice audio and the target voice style information, wherein the target voice style information is used for indicating the style of the target voice audio.

6. The method of claim 5, wherein the generating target voice audio based on the target dialect text information, timbre information of user voice audio, and target voice style information comprises:

inputting the text characteristic information into a pre-trained text encoder to obtain encoded text characteristic information;

inputting the coded text characteristic information, tone information of user voice and target voice style information into a pre-trained decoder to obtain Mel frequency spectrum information;

7. The method of claim 5, wherein the target speech style information is obtained by:

acquiring voice audio of a person with the voice style indicated by the target voice style information;

Inputting the voice audio of the person into a pre-trained voice style encoder to generate target voice style information.

8. An apparatus for generating audio, comprising:

the acquisition unit is configured to acquire target Mandarin text information and tone information of user voice audio sent by a target user;

A conversion unit configured to convert the target mandarin chinese text information into target dialect text information corresponding thereto;

The generating unit is configured to generate target voice audio based on the target dialect text information and tone information of the user voice audio, wherein the tone of the target voice audio is matched with the tone information of the user voice audio, and the target dialect text information is used for indicating the text information corresponding to the target voice audio;

The training samples corresponding to the pre-trained encoder and the pre-trained decoder are audio samples provided by different users marked with information for obtaining the target voice audio, the training samples are used for being input into an audio encoder to be trained, the audio encoder to be trained is used for outputting encoded audio samples, the encoded audio samples comprise tone information and encoded text characteristic information, the text characteristic information is used for training the text encoder to be trained, and the encoded audio samples are used for training the decoder to be trained.

9. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

When executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-7.

10. A computer readable medium having stored thereon a computer program, wherein the program when executed by a processor implements the method of any of claims 1-7.