CN112652292A

CN112652292A - Method, apparatus, device and medium for generating audio

Info

Publication number: CN112652292A
Application number: CN202011272812.3A
Authority: CN
Inventors: 汤本来
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2020-11-13
Filing date: 2020-11-13
Publication date: 2021-04-13

Abstract

Embodiments of the present disclosure disclose methods, apparatuses, devices and media for generating audio. One embodiment of the method for generating audio comprises: acquiring acoustic characteristic information, target age label information and target voice tone information of a source speaker; and generating target voice audio based on the acoustic characteristic information of the source speaker, the target age group label information and the target voice timbre information, wherein the target age group label information is used for indicating the age group information to which the target voice audio belongs, and the timbre of the target voice audio is matched with the target voice timbre information. The implementation method can convert the acoustic characteristic information of the source speaker into the voice audio with the target age group label information and the target voice timbre information, thereby realizing the switching between the age group to which the voice audio belongs and the timbre and enriching the generation mode of the voice audio.

Description

Method, apparatus, device and medium for generating audio

Technical Field

Embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a medium for generating audio.

Background

In recent years, due to the rapid development of short videos and digital entertainment media, there has been a great deal of attention and research on the conversion of source speaker speech into speech of designated speakers of different ages.

The method for switching the voice of the source speaker mainly utilizes frequency spectrum shifting to convert a voice signal into a frequency domain signal, then shifts the signal frequency spectrum to a high frequency domain range on the frequency domain as a whole, and finally converts the signal frequency spectrum back to a time domain to finally finish the voice changing purpose.

Disclosure of Invention

The present disclosure presents methods, apparatuses, devices and media for generating audio.

In a first aspect, an embodiment of the present disclosure provides a method for generating audio, the method including: acquiring acoustic characteristic information, target age label information and target voice tone information of a source speaker; and generating target voice audio based on the acoustic characteristic information of the source speaker, the target age group label information and the target voice timbre information, wherein the target age group label information is used for indicating the age group information to which the target voice audio belongs, and the timbre of the target voice audio is matched with the target voice timbre information.

In some embodiments, generating the target speech audio based on the acoustic feature information of the source speaker, the target age group tag information, and the target speech timbre information comprises: inputting acoustic characteristic information of a source speaker into a coder in a generation countermeasure network in a pre-trained generation countermeasure network to obtain coded acoustic characteristic information; inputting the coded acoustic characteristic information, target age group label information and target voice timbre information into a decoder in a generation countermeasure network in a pre-trained generation countermeasure network to obtain target acoustic characteristic information; and inputting the target acoustic characteristic information into a vocoder to obtain a target voice audio.

In some embodiments, the encoder and decoder are trained by: acquiring acoustic feature information samples which are provided by different users and are marked with acoustic feature information only having the voice tone information of the users; inputting the acoustic characteristic information sample into an encoder to be trained to obtain an encoded acoustic characteristic information sample; inputting the coded acoustic characteristic information sample into a decoder to obtain predicted acoustic characteristic information; based on the predicted acoustic characteristic information, the acoustic characteristic information corresponding to the input acoustic characteristic information sample and having the expected age label information, and the judgment network in the generation countermeasure network, training the encoder and the decoder which generate the network to obtain the encoder and the decoder which finish the initial training, adjusting the parameters of the encoder and the decoder which finish the initial training according to the deviation of the labeled acoustic characteristic information and the predicted acoustic characteristic information, and obtaining the encoder and the decoder which finish the training until the deviation meets the preset condition.

In some embodiments, training an encoder and a decoder for generating a network based on predicted acoustic feature information, acoustic feature information corresponding to an input acoustic feature information sample and having expected age group tag information, and a discriminant network in a generation countermeasure network, resulting in a preliminarily trained encoder and decoder, comprises: inputting the predicted acoustic characteristic information into an age group information classifier to obtain classified acoustic characteristic information; and inputting the classified acoustic characteristic information and acoustic characteristic information with expected age label information corresponding to the input acoustic characteristic information sample into a discrimination network to train an encoder and a decoder for generating the network, so as to obtain the encoder and the decoder which are preliminarily trained.

In some embodiments, generating the target speech audio based on the acoustic feature information of the source speaker and the target age group tag information includes: and generating a target voice audio based on the acoustic characteristic information of the source speaker, the age group label information of the source speaker, the target age group label information and the target voice timbre information, wherein the age group label information of the source speaker is used for indicating the age group information to which the acoustic characteristic information of the source speaker belongs.

In some embodiments, the gender of the different users is the same as the gender of the source speaker.

In some embodiments, the target speech timbre information is obtained by: acquiring voice audio of a person with the voice tone indicated by the target voice tone information; and inputting the voice audio of the personnel into a pre-trained voice tone color encoder to generate target voice tone color information.

In some embodiments, the acoustic signature information is mel-frequency spectrum information.

In a second aspect, an embodiment of the present disclosure provides an apparatus for generating audio, the apparatus comprising: an acquisition unit configured to acquire acoustic feature information of a source speaker, target age group tag information, and target voice timbre information; and a generating unit configured to generate a target voice audio based on the acoustic feature information of the source speaker, target age group tag information indicating age group information to which the target voice audio belongs, and target voice timbre information whose timbre matches the target voice timbre information.

In a third aspect, embodiments of the present disclosure provide an electronic device for generating audio, including: one or more processors; a storage device having one or more programs stored thereon, which when executed by the one or more processors, cause the one or more processors to implement the method of any of the embodiments of the method for generating audio as described above.

In a fourth aspect, embodiments of the present disclosure provide a computer-readable medium for generating audio, on which a computer program is stored, which when executed by a processor, implements the method of any of the embodiments of the method for generating audio as described above.

The method, the device, the equipment and the medium for generating the audio provided by the embodiment of the disclosure are used for generating the audio by acquiring acoustic characteristic information, target age group label information and target voice timbre information of a source speaker; the target voice audio is generated based on the acoustic characteristic information of the source speaker, the target age group label information and the target voice timbre information, wherein the target age group label information is used for indicating the age group information to which the target voice audio belongs, and the timbre of the target voice audio is matched with the target voice timbre information.

Drawings

Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;

FIG. 2 is a flow diagram of one embodiment of a method for generating audio according to the present disclosure;

FIG. 3 is a schematic diagram of one application scenario of a method for generating audio according to the present disclosure;

FIG. 4 is a flow diagram of yet another embodiment of a method for generating audio according to the present disclosure;

FIG. 5 is a schematic block diagram illustrating one embodiment of an apparatus for generating audio according to the present disclosure;

FIG. 6 is a schematic block diagram of a computer system suitable for use with an electronic device implementing embodiments of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the disclosure and are not limiting of the disclosure. It should be further noted that, for the convenience of description, only the portions relevant to the present disclosure are shown in the drawings.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary system architecture 100 of an embodiment of a method for generating audio or an apparatus for generating audio to which embodiments of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or transmit data (e.g., acoustic feature information of the source speaker, target age group tag information, and timbre information of the user's voice audio uttered by the target user), etc. The

terminal devices

101, 102, 103 may have various client applications installed thereon, such as audio playing software, music processing applications, news information applications, image processing applications, web browser applications, shopping applications, search applications, instant messaging tools, mailbox clients, social platform software, and the like.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices having information processing functions, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as a plurality of software or software modules (e.g., software or software modules used to provide a generated audio service) or as a single software or software module. And is not particularly limited herein.

The server 105 may be a server that provides various services, such as generating target voice audio based on the acoustic feature information of the source speaker, the target age group tag information, and the target voice timbre information transmitted from the

terminal devices

101, 102, 103, and based on the acoustic feature information of the source speaker, the target age group tag information, and the target voice timbre information. Optionally, the background audio processing server may further feed back the generated target voice audio to the terminal device, so that the terminal device can play the target voice audio. As an example, the server 105 may be a cloud server.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as a plurality of software or software modules (e.g., software or software modules for providing a service of generating audio) or as a single software or software module. And is not particularly limited herein.

It should be further noted that the method for generating audio provided by the embodiments of the present disclosure may be executed by a server, may also be executed by a terminal device, and may also be executed by the server and the terminal device in cooperation with each other. Accordingly, the various parts (e.g., the various units, sub-units, modules, and sub-modules) included in the apparatus for generating audio may be all disposed in the server, may be all disposed in the terminal device, and may be disposed in the server and the terminal device, respectively.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. The system architecture may only include the electronic device (e.g., server or terminal device) on which the method for generating audio operates, when the electronic device on which the method for generating audio operates does not require data transfer with other electronic devices.

With continued reference to fig. 2, a flow 200 of one embodiment of a method for generating audio in accordance with the present disclosure is shown. The method for generating audio comprises the following steps:

step 201, obtaining acoustic characteristic information of a source speaker, target age group label information and target voice timbre information.

In the present embodiment, the execution subject of the method for generating audio (e.g., the server 105 or the

terminal devices

101, 102, 103 shown in fig. 1) may obtain the acoustic feature information, the target age group tag information, and the target voice timbre information of the source speaker from other electronic devices by a wired connection manner or a wireless connection manner, or locally.

Wherein, the acoustic characteristic information of the source speaker is obtained according to the voice frequency of the source speaker. Here, the source speaker may be an arbitrary speaker. The source speaker speech audio can be audio of any speech uttered by the source speaker. For example, the audio of the source speaker's voice can be the audio of a song that the source speaker sings, or the audio of the voice uttered by the source speaker during a conversation.

The acoustic feature information is used to characterize the speech signal to be recognized. Specifically, the acoustic feature information may be characterized by linear prediction coefficients, cepstral coefficients, mel-frequency spectral coefficients, and the like. The linear prediction coefficient is mainly started from a human phonation mechanism, and through the research on a short-tube cascade model of a sound channel, the transfer function of the system is considered to conform to the form of an all-pole digital filter, so that the signal at the moment n can be estimated by linear combination of signals at a plurality of moments before. The cepstrum coefficient is obtained by solving discrete Fourier transform and logarithm of the voice signal and then solving inverse Fourier transform by using a homomorphic processing method. Mel-frequency spectral coefficients are acoustic features derived by the impetus of human auditory system research.

Here, the age group information indicated by the target age group tag information may be set according to experience, actual needs, and a specific application scenario, which is not limited in this application.

Specifically, if the age group information is set to 20 to 30 years, 30 to 40 years, and 40 to 50 years, respectively, the age group information indicated by the target age group tag information may be 30 to 40 years.

The target voice tone information is used for indicating tone information of user voice audio sent by a target user. The execution subject may input the voice audio to the pre-trained voice tone information generation model to obtain voice tone information of the voice audio. The voice tone information generation model can be obtained by training based on a voice audio sample marked with voice tone information.

In some alternative ways, the target voice timbre information is obtained by: acquiring voice audio of a person with the voice tone indicated by the target voice tone information; and inputting the voice audio of the person into a pre-trained voice tone color encoder to generate target voice tone color information.

In this implementation, the execution subject may obtain the voice audio of the person having the voice style indicated by the target voice tone information, and input the voice audio of the person into the pre-trained voice tone encoder to obtain the target voice style information.

The voice tone color encoder is used for capturing tone color characteristics of input voice audio, the tone color characteristics are independent of text characteristics and style characteristics corresponding to the voice audio, and the output of the pre-trained tone color encoder can be embodied in the form of embedded vectors.

For example, if the tone information of the target voice audio is the tone of a certain speaker, the tone information of the certain singer may be used as the target voice tone information, the voice audio of the certain speaker may be obtained, and the voice audio of the certain speaker may be input to the pre-trained tone style encoder to obtain the target voice tone information.

The realization mode obtains the voice audio of the person with the voice tone indicated by the target voice tone information; the voice audio of the person is input into the pre-trained voice tone color encoder to generate target voice tone color information, so that tone color characteristics of the voice audio can be better captured, and the accuracy of the acquired target voice tone color information is improved.

In some optional ways, the acoustic feature information is mel-frequency spectrum information.

In this implementation manner, the acoustic feature information is mel-frequency spectrum information, that is, the acoustic feature information can be characterized by a mel-frequency spectrum coefficient or a mel-frequency spectrum cepstrum coefficient.

The method can effectively reduce noise interference in the voice of the speaker by setting the acoustic characteristic information as Mel frequency spectrum information, and fully capture the voice characteristic information of the speaker.

Step 202, generating a target voice audio based on the acoustic feature information of the source speaker, the target age group tag information and the target voice timbre information.

In this embodiment, the execution body may generate the target voice audio from the acoustic feature information of the source speaker, the target age group tag information, and the target voice timbre information. The target age group label information is used for indicating age group information to which the target voice audio belongs.

As an example, the execution subject may input the acoustic feature information of the source speaker, the target age group tag information, and the target voice timbre information to a pre-trained audio generation model to generate the target voice audio. The audio generation model can be obtained by training based on acoustic characteristic information sample data marked with expected age label information and target voice timbre information.

Specifically, the execution subject may first obtain the voice audio of the conversation of the source speaker, perform feature extraction on the voice audio of the source speaker to obtain acoustic feature information of the source speaker, and then obtain target age group tag information and target voice timbre information. The acoustic feature information of the source speaker indicates the age range of 5 years to 15 years, the target age range tag information indicates the age range of 45 years to 55 years, and the target voice timbre information indicates timbre information of voice audio uttered by the target user K, so that if the execution subject inputs the acoustic feature information of the source speaker, the target age range tag information and the target voice timbre information into a pre-trained audio generation model, voice audio with the age range information indicated by the target age range tag information and timbre information of voice audio uttered by the target user K, that is, the age range of 45 years to 55 years, and the timbre information is voice audio of timbre of the user K, the switching between the age range and timbre to which the voice audio of the source speaker belongs can be realized.

With continued reference to fig. 3, fig. 3 is a schematic diagram of one application scenario of the method for generating audio according to the present embodiment. In the application scenario of fig. 3, the execution subject 301 first obtains acoustic feature information 302, target age group tag information 303, and target voice timbre information 304 of a source speaker, where the age group information to which the source speaker belongs is 50 years to 60 years old, the age group information indicated by the target age group tag information 303 is 15 years to 25 years old, and the timbre information indicated by the target voice timbre information 304 is timbre information of voice audio uttered by the target user M. The execution subject generates a target voice audio 305 based on the acoustic feature information 302 of the source speaker, target age group tag information 303 and target voice timbre information 304, wherein the target age group tag information 303 is used for indicating that the age group information to which the target voice audio 305 belongs, namely the age group information to which the target voice audio 305 belongs, is 15 years to 25 years old, and the timbre of the target voice audio matches with the target voice timbre information, namely the target voice audio 305 has timbre information of the user M.

The method provided by the above embodiment of the present disclosure obtains acoustic feature information, target age group tag information and target voice timbre information of a source speaker; the target voice audio is generated based on the acoustic characteristic information of the source speaker, the target age group label information and the target voice timbre information, wherein the target age group label information is used for indicating the age group information to which the target voice audio belongs, and the timbre of the target voice audio is matched with the target voice timbre information.

With further reference to fig. 4, a flow 400 of yet another embodiment of a method for generating audio is shown. The flow 400 of the method for generating audio comprises the steps of:

step 401, obtaining acoustic feature information of a source speaker, target age group tag information and target voice timbre information.

In this embodiment, step 401 is substantially the same as step 201 in the corresponding embodiment of fig. 2, and is not described here again.

Step 402, inputting the acoustic feature information of the source speaker into the encoder in the generation countermeasure network in the pre-training mode to obtain the encoded acoustic feature information.

In this embodiment, the encoder is used to encode acoustic feature information of a source speaker, and the encoder may be implemented based on a Convolutional Neural network in the prior art or in a future development technology, for example, CNN (Convolutional Neural network), LSTM (Long Short-Term Memory network), GRU (Gated Recurrent Unit), BGRU (Bidirectional Gated Recurrent Unit), and the like, which is not limited in this application.

And 403, inputting the coded acoustic feature information, the target age group tag information and the target voice timbre information into a decoder in a generation countermeasure network in the pre-training mode to obtain target acoustic feature information.

In this embodiment, the execution subject may input the encoded acoustic feature information, the target age group tag information, and the target speech timbre information to a pre-trained decoder to obtain the target acoustic feature information.

The target age group label information is information obtained by encoding the target age group label.

Here, the manner of encoding the target age group tag may include various manners, for example, gray code, one-hot code (one-hot), and the like.

In addition, the pre-trained decoder may be an autoregressive decoder, or may be a non-autoregressive decoder, which is not limited in this application. Preferably, the decoder is an autoregressive decoder, and compared with encoders adopting other forms, the autoregressive decoder can better utilize the dependence characteristics of the voice audio on different time scales to improve the generation quality of the target voice audio.

Step 404, inputting the target acoustic feature information into the vocoder to obtain the target voice audio.

In this embodiment, the execution subject inputs the acoustic feature information obtained in the above steps into the vocoder to obtain the target voice audio. The vocoder is used for representing the corresponding relation between the acoustic characteristic information and the voice audio.

In some alternatives, the encoder and decoder are trained by: acquiring acoustic feature information samples which are provided by different users and are marked with acoustic feature information only having the voice tone information of the users; inputting the acoustic characteristic information sample into an encoder to be trained to obtain an encoded acoustic characteristic information sample; inputting the coded acoustic characteristic information sample into a decoder to be trained to obtain predicted acoustic characteristic information; training an encoder and a decoder for generating a network based on the predicted acoustic feature information, the acoustic feature information corresponding to the input acoustic feature information sample and having the expected age label information and a judgment network to obtain the encoder and the decoder which are preliminarily trained, adjusting parameters of the encoder and the decoder which are preliminarily trained according to the deviation of the labeled acoustic feature information and the predicted acoustic feature information until the deviation meets the preset condition, and obtaining the encoder and the decoder which are trained.

In this implementation, the pre-trained encoder and the pre-trained decoder are trained by: firstly, obtaining acoustic characteristic information samples provided by different users and marked with acoustic characteristic information with the voice tone information of the users. Here, the acoustic feature information sample provided by the user may include an acoustic feature information sample corresponding to conversational voice audio uttered by the user, an acoustic feature information sample corresponding to singing voice audio uttered by the user, and the like.

And then, inputting the acoustic characteristic information sample into an encoder to be trained to obtain an encoded acoustic characteristic information sample, and inputting the encoded acoustic characteristic information sample into a decoder to be trained to obtain predicted acoustic characteristic information.

Further, the executing entity may directly input the predicted acoustic feature information and the acoustic feature information with the expected age group tag information corresponding to the input acoustic feature information sample into the pre-trained decision network, or may first classify the predicted acoustic feature information output by the decoder into age group information, and then input the classified acoustic feature information and the acoustic feature information with the expected age group tag information corresponding to the input acoustic feature information sample into the decision network, and then train the encoder and the decoder according to the output result of the decision network to obtain the preliminarily trained encoder and decoder, which is not limited in this application.

And finally, the execution main body adjusts the parameters of the encoder and the decoder which are preliminarily trained according to the deviation between the marked acoustic characteristic information and the predicted acoustic characteristic information until the deviation meets the preset condition, and then the encoder and the decoder which are trained are obtained.

This implementation is through training the encoder and the decoder of treating the training based on the acoustic characteristic information sample that obtains different users and provide, obtains the encoder and the decoder that the training was accomplished for the encoder and the decoder that the training was accomplished can learn the characteristics of the age bracket information of different acoustic characteristic information samples and the characteristics of pronunciation tone information, help promoting the generalization ability of the encoder and the decoder that the training was accomplished.

In some alternative approaches, the gender of the different users is the same as the gender of the source speaker

In this implementation, the gender of the different users providing the acoustic feature information samples is the same as the gender of the source speaker. Specifically, the source speaker is a female who belongs to 10 to 20 years old, and the user who provides the acoustic feature information sample may be a female who belongs to any age group.

According to the method, the generated network is trained by acquiring acoustic characteristic information samples provided by different users with the same gender as the source speaker, so that the accuracy and reliability of the generated network are improved, and the accuracy and reliability of the generated target voice audio are improved.

In some alternatives, generating the target speech audio based on the acoustic feature information of the source speaker, the target age group tag information, and the target speech timbre information includes: and generating a target voice audio based on the acoustic characteristic information of the source speaker, the age group label information of the source speaker, the target age group label information and the target voice timbre information.

In this implementation, the execution subject may first input the acoustic feature information of the source speaker into the encoder in the generation countermeasure network in the pre-trained generation countermeasure network to obtain encoded acoustic feature information; inputting the coded acoustic characteristic information, target age group label information (25 years to 35 years), age group label information (5 years to 15 years) of a source speaker and target voice tone information into a decoder in a generation countermeasure network in a pre-training mode to obtain target acoustic characteristic information; and inputting the target acoustic characteristic information into a vocoder to obtain a target voice audio.

The age group tag information of the source speaker is used for indicating age group information to which the acoustic feature information of the source speaker belongs.

According to the method, the encoded acoustic feature information, the target age group tag information and the age group tag information of the source speaker are input into a decoder in a generation countermeasure network in a pre-training mode to obtain the target acoustic feature information, so that the generation network can generate the target voice audio according to the difference between the target age group tag information and the age group tag information of the source speaker, and the process of generating the target voice audio can be accelerated.

In some optional manners, training an encoder and a decoder generating a network based on the predicted acoustic feature information, the acoustic feature information corresponding to the input acoustic feature information sample and having the expected age group tag information, and the decision network to obtain a preliminarily trained encoder and decoder, includes: inputting the predicted acoustic characteristic information into an age group information classifier to obtain classified acoustic characteristic information; and inputting the classified acoustic characteristic information and acoustic characteristic information with expected age label information corresponding to the input acoustic characteristic information sample into a discrimination network to train an encoder and a decoder for generating the network, so as to obtain the encoder and the decoder which are preliminarily trained.

In this implementation, after obtaining the predicted acoustic feature information output by the decoder, the execution subject may input the predicted acoustic feature information into the age group information classifier to obtain the classified acoustic feature information. And further, inputting the classified acoustic feature information and acoustic feature information with expected age label information corresponding to the input acoustic feature information sample into a discrimination network, and training the encoder and the decoder according to an output result of the discrimination network to obtain the preliminarily trained encoder and decoder.

This implementation is through categorizing acoustic feature information input age bracket information classifier to acoustic feature information after will categorizing and acoustic feature information input that has the expected age bracket label information corresponding with the acoustic feature information sample of input differentiate the network in order to train the encoder and the decoder that generate the network, obtain the encoder and the decoder that preliminary training was accomplished, help accelerating the training process to encoder and decoder, promote the accuracy that generates the target speech audio of network output simultaneously.

As can be seen from fig. 4, compared with the embodiment corresponding to fig. 2, the flow 400 of the method for generating audio in this embodiment highlights that the encoded acoustic feature information is obtained by inputting the acoustic feature information of the source speaker into the encoder in the generation confrontation network in the pre-training mode; inputting the coded acoustic characteristic information, target age group label information and target voice timbre information into a decoder in a generation countermeasure network in a pre-trained generation countermeasure network to obtain target acoustic characteristic information; and inputting the target acoustic characteristic information into a vocoder to obtain a target voice audio. Therefore, the target voice audio can have any specified target age group label information and any specified target voice tone information, the flexibility of the generated target voice audio is improved, and the target voice audio is generated by adopting a vocoder (vocoder), so that the generated target voice audio is closer to the real voice audio, and the synthetic effect can be more natural.

With further reference to fig. 5, as an implementation of the method shown in the above figures, the present disclosure provides an embodiment of an apparatus for generating audio, which corresponds to the method embodiment shown in fig. 2, and which may include the same or corresponding features as the method embodiment shown in fig. 2 and produce the same or corresponding effects as the method embodiment shown in fig. 2, in addition to the features described below. The device can be applied to various electronic equipment.

As shown in fig. 5, the apparatus 500 for generating audio of the present embodiment includes: the obtaining module 501 is configured to obtain acoustic feature information of a source speaker, target age group tag information, and target voice timbre information; the generating module 502 is configured to generate a target speech audio based on the acoustic feature information of the source speaker, the target age group tag information, and the target speech timbre information.

In this embodiment, the generating module 502 may generate the target voice audio based on the acoustic feature information of the source speaker, the target age group tag information, and the target voice timbre information acquired by the acquiring module 501. The target age group label information is used for indicating age group information to which the target voice audio belongs, and the tone of the target voice audio is matched with the tone information of the target voice.

In some optional implementations of this embodiment, the generating module 502 includes: an encoding unit (not shown in the figure) configured to input the acoustic feature information of the source speaker into an encoder in a pre-trained generation countermeasure network to obtain encoded acoustic feature information; a decoding unit (not shown in the figure) configured to input the encoded acoustic feature information, the target age group tag information and the target voice timbre information into a decoder in a generation countermeasure network in a pre-trained generation countermeasure network to obtain target acoustic feature information; the obtaining unit (not shown in the figure) is configured to input the target acoustic feature information into the vocoder, resulting in the target speech audio.

In some optional implementations of this embodiment, the encoder and the decoder are trained by: acquiring acoustic feature information samples which are provided by different users and are marked with acoustic feature information only having the voice tone information of the users; inputting the acoustic characteristic information sample into an encoder to be trained to obtain an encoded acoustic characteristic information sample; inputting the coded acoustic characteristic information sample into a decoder to obtain predicted acoustic characteristic information; training an encoder and a decoder for generating a network to obtain an encoder and a decoder which are preliminarily trained based on the predicted acoustic feature information, the acoustic feature information corresponding to the input acoustic feature information sample and having expected age label information and a discrimination network in the generation countermeasure network; and adjusting parameters of the preliminarily trained encoder and decoder according to the deviation between the marked acoustic characteristic information and the predicted acoustic characteristic information until the deviation meets a preset condition, thus obtaining the trained encoder and decoder.

In some optional implementations of this embodiment, training an encoder and a decoder that generate a network based on predicted acoustic feature information, acoustic feature information corresponding to an input acoustic feature information sample and having expected age group tag information, and a discriminant network in a generation countermeasure network to obtain an encoder and a decoder that are preliminarily trained includes: inputting the predicted acoustic characteristic information into an age group information classifier to obtain classified acoustic characteristic information; and inputting the classified acoustic characteristic information and acoustic characteristic information with expected age label information corresponding to the input acoustic characteristic information sample into a discrimination network to train an encoder and a decoder for generating the network, so as to obtain the encoder and the decoder which are preliminarily trained.

In some optional implementations of this embodiment, the generating module is further configured to generate the target speech audio based on the acoustic feature information of the source speaker, the age group tag information of the source speaker, the target age group tag information, and the target speech timbre information, wherein the age group tag information of the source speaker is used to indicate age group information to which the acoustic feature information of the source speaker belongs.

In some alternative implementations of the present embodiment, the gender of the different users is the same as the gender of the source speaker.

In some optional implementations of this embodiment, the target voice tone information is obtained by: acquiring voice audio of a person with the voice tone indicated by the target voice tone information; and inputting the voice audio of the personnel into a pre-trained voice tone color encoder to generate target voice tone color information.

In some optional implementations of this embodiment, the acoustic feature information is mel-frequency spectrum information.

The apparatus provided by the above embodiment of the present disclosure acquires acoustic feature information, target age group tag information, and target voice timbre information of a source speaker through the acquisition module 501, and then the generation module 502 generates a target voice audio based on the acoustic feature information, the target age group tag information, and the target voice timbre information of the source speaker, where the target age group tag information is used to indicate age group information to which the target voice audio belongs, and the timbre of the target voice audio matches the target voice timbre information, and can convert the acoustic feature information of the source speaker into a voice audio having the target age group tag information and the target voice timbre information, thereby implementing switching between the age group of the voice audio and the timbre, and enriching the generation manner of the voice audio.

Referring now to fig. 6, a schematic diagram of an electronic device (e.g., the server or terminal device of fig. 1) 600 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a fixed terminal such as a digital TV, a desktop computer, and the like. The terminal device/server shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 6, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 6 may represent one device or may represent multiple devices as desired.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer-readable signal medium may include a propagated data signal with computer-readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Python, Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In accordance with one or more embodiments of the present disclosure, there is provided a method for generating audio, the method comprising: acquiring acoustic characteristic information, target age label information and target voice tone information of a source speaker; and generating target voice audio based on the acoustic characteristic information of the source speaker, the target age group label information and the target voice timbre information, wherein the target age group label information is used for indicating the age group information to which the target voice audio belongs, and the timbre of the target voice audio is matched with the target voice timbre information.

According to one or more embodiments of the present disclosure, in a method for generating audio provided by the present disclosure, generating target voice audio based on acoustic feature information of a source speaker, target age group tag information, and target voice timbre information includes: inputting acoustic characteristic information of a source speaker into a coder in a generation countermeasure network in a pre-trained generation countermeasure network to obtain coded acoustic characteristic information; inputting the coded acoustic characteristic information, target age group label information and target voice timbre information into a decoder in a generation countermeasure network in a pre-trained generation countermeasure network to obtain target acoustic characteristic information; and inputting the target acoustic characteristic information into a vocoder to obtain a target voice audio.

According to one or more embodiments of the present disclosure, the present disclosure provides a method for generating audio in which an encoder and a decoder are trained by: acquiring acoustic feature information samples which are provided by different users and are marked with acoustic feature information only having the voice tone information of the users; inputting the acoustic characteristic information sample into an encoder to be trained to obtain an encoded acoustic characteristic information sample; inputting the coded acoustic characteristic information sample into a decoder to obtain predicted acoustic characteristic information; training an encoder and a decoder for generating a network to obtain an encoder and a decoder which are preliminarily trained based on the predicted acoustic feature information, the acoustic feature information corresponding to the input acoustic feature information sample and having expected age label information and a discrimination network in the generation countermeasure network; and adjusting parameters of the preliminarily trained encoder and decoder according to the deviation between the marked acoustic characteristic information and the predicted acoustic characteristic information until the deviation meets a preset condition, thus obtaining the trained encoder and decoder.

According to one or more embodiments of the present disclosure, in a method for generating audio provided by the present disclosure, training an encoder and a decoder for generating a network based on predicted acoustic feature information, acoustic feature information corresponding to an input acoustic feature information sample and having expected age group tag information, and a discriminant network in a generation countermeasure network, to obtain a preliminarily trained encoder and decoder, includes: inputting the predicted acoustic characteristic information into an age group information classifier to obtain classified acoustic characteristic information; and inputting the classified acoustic characteristic information and acoustic characteristic information with expected age label information corresponding to the input acoustic characteristic information sample into a discrimination network to train an encoder and a decoder for generating the network, so as to obtain the encoder and the decoder which are preliminarily trained.

According to one or more embodiments of the present disclosure, in a method for generating audio provided by the present disclosure, generating target voice audio based on acoustic feature information of a source speaker, target age group tag information, and target voice timbre information includes: and generating a target voice audio based on the acoustic characteristic information of the source speaker, the age group label information of the source speaker, the target age group label information and the target voice timbre information, wherein the age group label information of the source speaker is used for indicating the age group information to which the acoustic characteristic information of the source speaker belongs.

According to one or more embodiments of the present disclosure, the present disclosure provides a method for generating audio in which the gender of different users is the same as the gender of the source speaker. Acoustic feature information acoustic feature information

According to one or more embodiments of the present disclosure, the present disclosure provides a method for generating audio, in which target voice tone information is obtained by: acquiring voice audio of a person with the voice tone indicated by the target voice tone information; and inputting the voice audio of the personnel into a pre-trained voice tone color encoder to generate target voice tone color information.

According to one or more embodiments of the present disclosure, the present disclosure provides a method for generating audio, in which the acoustic feature information is mel-frequency spectrum information.

In accordance with one or more embodiments of the present disclosure, there is provided an apparatus for generating audio, the apparatus comprising: the acquisition module is configured to acquire acoustic feature information of a source speaker, target age group tag information and target voice timbre information; the generating module is configured to generate a target voice audio based on the acoustic feature information of the source speaker, target age group tag information and target voice timbre information, wherein the target age group tag information is used for indicating age group information to which the target voice audio belongs, and the timbre of the target voice audio is matched with the target voice timbre information.

In accordance with one or more embodiments of the present disclosure, in an apparatus for generating audio provided by the present disclosure, a generating module includes: the encoding unit is configured to input the acoustic feature information of the source speaker into an encoder in a generation countermeasure network in a pre-trained generation countermeasure network to obtain encoded acoustic feature information; the decoding unit is configured to input the encoded acoustic feature information, the target age group tag information and the target voice timbre information into a decoder in a generation countermeasure network in a pre-trained generation countermeasure network to obtain target acoustic feature information; the obtaining unit is configured to input the target acoustic feature information into the vocoder, resulting in the target speech audio.

According to one or more embodiments of the present disclosure, in an apparatus for generating audio provided by the present disclosure, an encoder and the decoder are trained by: the encoder and the decoder are trained by: acquiring acoustic feature information samples which are provided by different users and are marked with acoustic feature information only having the voice tone information of the users; inputting the acoustic characteristic information sample into an encoder to be trained to obtain an encoded acoustic characteristic information sample; inputting the coded acoustic characteristic information sample into a decoder to obtain predicted acoustic characteristic information; training an encoder and a decoder for generating a network to obtain an encoder and a decoder which are preliminarily trained based on the predicted acoustic feature information, the acoustic feature information corresponding to the input acoustic feature information sample and having expected age label information and a discrimination network in the generation countermeasure network; and adjusting parameters of the preliminarily trained encoder and decoder according to the deviation between the marked acoustic characteristic information and the predicted acoustic characteristic information until the deviation meets a preset condition, thus obtaining the trained encoder and decoder.

According to one or more embodiments of the present disclosure, in an apparatus for generating audio, training an encoder and a decoder for generating a network based on predicted acoustic feature information, acoustic feature information corresponding to an input acoustic feature information sample and having expected age group tag information, and a discriminant network in a generation countermeasure network, to obtain an encoder and a decoder with preliminarily trained, the apparatus includes: inputting the predicted acoustic characteristic information into an age group information classifier to obtain classified acoustic characteristic information; and inputting the classified acoustic characteristic information and acoustic characteristic information with expected age label information corresponding to the input acoustic characteristic information sample into a discrimination network to train an encoder and a decoder for generating the network, so as to obtain the encoder and the decoder which are preliminarily trained.

According to one or more embodiments of the present disclosure, in an apparatus for generating audio provided by the present disclosure, a target voice audio is generated based on acoustic feature information of a source speaker based on a human, age group tag information of the source speaker, target age group tag information, and target voice timbre information, wherein the age group tag information of the source speaker is used to indicate age group information to which the acoustic feature information of the source speaker belongs.

According to one or more embodiments of the present disclosure, the present disclosure provides an apparatus for generating audio in which the gender of different users is the same as the gender of the source speaker.

According to one or more embodiments of the present disclosure, in an apparatus for generating audio provided by the present disclosure, target voice timbre information is obtained by: acquiring voice audio of a person with the voice tone indicated by the target voice tone information; and inputting the voice audio of the personnel into a pre-trained voice tone color encoder to generate target voice tone color information.

According to one or more embodiments of the present disclosure, in an apparatus for generating audio provided by the present disclosure, the acoustic feature information is mel-frequency spectrum information.

The modules described in the embodiments of the present disclosure may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes an acquisition module and a generation module. The names of these modules do not limit the module itself in some cases, and for example, the acquiring module may be further described as a module for acquiring acoustic feature information of a source speaker, target age group tag information, and target voice timbre information.

As another aspect, the present disclosure also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring acoustic characteristic information, target age label information and target voice tone information of a source speaker; and generating target voice audio based on the acoustic characteristic information of the source speaker, target age group label information and target voice timbre information, wherein the target age group label information is used for indicating the age group information to which the target voice audio belongs, and the timbre of the target voice audio is matched with the target voice timbre information.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims

1. A method for generating audio, comprising:

acquiring acoustic characteristic information, target age label information and target voice tone information of a source speaker;

and generating target voice audio based on the acoustic characteristic information of the source speaker, target age group label information and target voice timbre information, wherein the target age group label information is used for indicating the age group information to which the target voice audio belongs, and the timbre of the target voice audio is matched with the target voice timbre information.

2. The method of claim 1, wherein generating the target speech audio based on the acoustic feature information of the source speaker, the target age group tag information, and the target speech timbre information comprises:

inputting the acoustic characteristic information of the source speaker into a coder in a generation countermeasure network in a pre-trained generation countermeasure network to obtain coded acoustic characteristic information;

inputting the coded acoustic characteristic information, target age group label information and target voice timbre information into a decoder in a generation countermeasure network in a pre-trained generation countermeasure network to obtain target acoustic characteristic information;

and inputting the target acoustic characteristic information into a vocoder to obtain a target voice audio.

3. The method of claim 2, wherein the encoder and the decoder are trained by:

acquiring acoustic feature information samples which are provided by different users and are marked with acoustic feature information only having the voice tone information of the users;

inputting the acoustic characteristic information sample into an encoder to be trained to obtain an encoded acoustic characteristic information sample;

inputting the coded acoustic characteristic information sample into a decoder to obtain predicted acoustic characteristic information;

training an encoder and a decoder for generating a network to obtain an encoder and a decoder which are preliminarily trained based on the predicted acoustic feature information, the acoustic feature information corresponding to the input acoustic feature information sample and having expected age group tag information and a discrimination network in a generation countermeasure network;

and adjusting parameters of the preliminarily trained encoder and decoder according to the deviation between the marked acoustic characteristic information and the predicted acoustic characteristic information until the deviation meets a preset condition, thus obtaining the trained encoder and decoder.

4. The method of claim 3, wherein training the encoder and decoder for generating the network based on the predicted acoustic feature information, the acoustic feature information corresponding to the input acoustic feature information samples and having the expected age group tag information, and the discriminant network in the generation countermeasure network, resulting in a preliminarily trained encoder and decoder, comprises:

inputting the predicted acoustic characteristic information into an age group information classifier to obtain classified acoustic characteristic information;

and inputting the classified acoustic characteristic information and acoustic characteristic information with expected age label information corresponding to the input acoustic characteristic information sample into a discrimination network to train an encoder and a decoder for generating the network, so as to obtain the encoder and the decoder which are preliminarily trained.

5. The method of claim 1, wherein generating the target speech audio based on the acoustic feature information of the source speaker, the target age group tag information, and the target speech timbre information comprises:

and generating a target voice audio based on the acoustic characteristic information of the source speaker, the age group label information of the source speaker, the target age group label information and the target voice timbre information, wherein the age group label information of the source speaker is used for indicating the age group information to which the acoustic characteristic information of the source speaker belongs.

6. The method of claim 3, wherein the gender of the different user is the same as the gender of the source speaker.

7. The method of claim 1, wherein the target speech timbre information is obtained by:

acquiring voice audio of a person with the voice tone indicated by the target voice tone information;

and inputting the voice audio of the personnel into a pre-trained voice tone color encoder to generate target voice tone color information.

8. The method according to any one of claims 1-7, wherein the acoustic feature information is mel-frequency spectrum information.

9. An apparatus for generating audio, comprising:

an acquisition unit configured to acquire acoustic feature information of a source speaker, target age group tag information, and target voice timbre information;

a generating unit configured to generate a target voice audio based on the acoustic feature information of the source speaker, target age group tag information indicating age group information to which the target voice audio belongs, and target voice timbre information whose timbre matches the target voice timbre information.

10. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-8.

11. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-8.