CN117546238A

CN117546238A - Method, device and storage medium for generating audio

Info

Publication number: CN117546238A
Application number: CN202280004612.0A
Authority: CN
Inventors: 张�浩; 王凯; 尹旭东; 史润宇
Original assignee: Beijing Xiaomi Mobile Software Co Ltd
Current assignee: Beijing Xiaomi Mobile Software Co Ltd
Priority date: 2022-06-07
Filing date: 2022-06-07
Publication date: 2024-02-09
Also published as: WO2023236054A1

Abstract

The present disclosure relates to a method, apparatus and storage medium for generating audio. The method for generating the audio comprises the following steps: acquiring original audio time domain data; extracting tone characteristics of the original audio time domain data to obtain original tone characteristics; generating target timbre audio time domain data based on the original audio time domain data, the original timbre characteristics and target timbre characteristics, wherein semantic characteristics in the target timbre audio time domain data are matched with the semantic characteristics of the original audio time domain data, and timbre characteristics in the target timbre audio time domain data are matched with the target timbre characteristics. According to the method and the device, in the process of audio conversion, information loss of audio converted from a time domain to a frequency domain is avoided, and audio time domain data after tone color conversion are more vivid.

Description

Method, device and storage medium for generating audio

Technical Field

The present disclosure relates to the field of audio technologies, and in particular, to a method, an apparatus, and a storage medium for generating audio.

Background

Sound conversion techniques have a wide range of applications, among which audio tone conversion techniques are one type of sound conversion.

The audio tone conversion is realized by extracting semantic information and specific tone features which are irrelevant to tone from audio forms (time sequence, frequency spectrum and the like) by a computer and then combining the semantic information with different tone features.

In the related art, audio tone is changed by changing the style of a spectral image by converting audio time series data into spectral data, and finally, the spectral data after tone conversion is converted into time series data. However, this method may have a phenomenon that audio time domain data after tone color conversion is not realistic enough.

Disclosure of Invention

To overcome the problems in the related art, the present disclosure provides a method, apparatus, and storage medium for generating audio.

According to a first aspect of embodiments of the present disclosure, there is provided a method of generating audio, comprising:

acquiring original audio time domain data; extracting tone characteristics of the original audio time domain data to obtain original tone characteristics; generating target timbre audio time domain data based on the original audio time domain data, the original timbre characteristics and target timbre characteristics, wherein semantic characteristics in the target timbre audio time domain data are matched with the semantic characteristics of the original audio time domain data, and timbre characteristics in the target timbre audio time domain data are matched with the target timbre characteristics.

In one embodiment, the generating the target timbre audio time domain data based on the original audio time domain data, the original timbre feature, and the target timbre feature includes: generating network models based on the original audio time domain data, the original tone characteristics, the target tone characteristics and pre-trained audio, and generating target tone audio time domain data; the audio generation network model is used for performing tone color conversion on the audio time domain data to generate tone color converted audio time domain data.

In one embodiment, the generating the target timbre audio time domain data based on the original audio time domain data, the original timbre features, and target timbre features, and a pre-trained audio generation network model includes: obtaining semantic features of the original audio time domain data based on the original audio time domain data and a semantic encoder included in an audio generation network model; generating target timbre audio time domain data based on the semantic features, the original timbre features, the target timbre features, and a generator included in the audio generation network model; the generator is for generating audio time domain data based on the semantic features and the audio features.

In one embodiment, the obtaining the semantic feature of the original audio time domain data based on the original audio time domain data and a semantic encoder included in an audio generation network model includes: inputting the original audio time domain data to a semantic encoder included in an audio generation network model, and inputting semantic features output by the semantic encoder to a tone class classifier included in the audio generation network model; the tone color class classifier is used for identifying tone color classes of input semantic features; and constraining the semantic encoder based on the output of the tone class classifier, so that the semantic features output by the semantic encoder do not contain any tone features, and obtaining the semantic features of the original audio time domain data.

In one embodiment, the generator is trained in the following manner: inputting the first audio time domain data into the semantic encoder to obtain first audio semantic features, and inputting the first audio semantic features and target tone features into a prediction generator to obtain target tone audio time domain prediction data; inputting the target tone color audio time domain prediction data to the semantic encoder to obtain semantic features of the target tone color audio time domain prediction data, and inputting the semantic features and the tone color features of the first audio to a prediction generator to obtain second audio time domain prediction data; determining true/false countermeasure loss and tone characteristic regression loss by a preset discriminator based on the first audio time domain data and the corresponding tone characteristic; determining true/false countermeasure loss and timbre feature regression loss by a discriminator based on the target timbre audio time domain prediction data and the target timbre feature; determining a reconstruction loss based on the first audio time domain data, the second audio time domain prediction data; and constraining the training of the prediction generator based on the true/false countermeasure loss, the tone characteristic regression loss and the reconstruction loss to obtain a generator meeting constraint conditions.

According to a second aspect of embodiments of the present disclosure, there is provided an apparatus for generating audio, comprising:

the acquisition unit is used for acquiring the original audio time domain data; the extraction unit is used for extracting tone characteristics of the original audio time domain data to obtain original tone characteristics; the generating unit is used for generating target tone color audio time domain data based on the original audio time domain data, the original tone color characteristics and the target tone color characteristics, semantic characteristics in the target tone color audio time domain data are matched with the semantic characteristics of the original audio time domain data, and tone color characteristics in the target tone color audio time domain data are matched with the target tone color characteristics.

In one embodiment, the generating unit generates the target timbre audio time domain data based on the original audio time domain data, the original timbre feature, and the target timbre feature in the following manner: generating network models based on the original audio time domain data, the original tone characteristics, the target tone characteristics and pre-trained audio, and generating target tone audio time domain data; the audio generation network model is used for performing tone color conversion on the audio time domain data to generate tone color converted audio time domain data.

In one embodiment, the generating unit generates the target timbre audio time domain data based on the original audio time domain data, the original timbre feature, and the target timbre feature, and a pre-trained audio generation network model in the following manner: obtaining semantic features of the original audio time domain data based on the original audio time domain data and a semantic encoder included in an audio generation network model; generating target timbre audio time domain data based on the semantic features, the original timbre features, the target timbre features, and a generator included in the audio generation network model; the generator is configured to generate audio time domain data corresponding to a tone based on the semantic features and the tone features.

In one embodiment, the generating unit obtains semantic features of the original audio time domain data based on the original audio time domain data and a semantic encoder included in an audio generation network model in the following manner: inputting the original audio time domain data to a semantic encoder included in an audio generation network model, and inputting semantic features output by the semantic encoder to a tone class classifier included in the audio generation network model; the tone color class classifier is used for identifying the class of tone color characteristics; and constraining the semantic encoder based on the output of the tone class classifier, so that the semantic features output by the semantic encoder do not include tone features, and obtaining the semantic features of the original audio time domain data.

In one embodiment, the generator is pre-trained in the following manner: inputting the first audio time domain data into the semantic encoder to obtain first audio semantic features, and inputting the first audio semantic features and target tone features into a prediction generator to obtain target tone audio time domain prediction data; inputting the target tone color audio time domain prediction data to the semantic encoder to obtain semantic features of the target tone color audio time domain prediction data, and inputting the semantic features and the tone color features of the first audio to a prediction generator to obtain second audio time domain prediction data; determining true/false countermeasure loss and tone characteristic regression loss by a preset discriminator based on the first audio time domain data and the corresponding tone characteristic; determining true/false countermeasure loss and timbre feature regression loss by a discriminator based on the target timbre audio time domain prediction data and the target timbre feature; determining a reconstruction loss based on the first audio time domain data, the second audio time domain prediction data; and constraining the training of the prediction generator based on the true/false countermeasure loss, the tone characteristic regression loss and the reconstruction loss to obtain a generator meeting constraint conditions.

According to a third aspect of embodiments of the present disclosure, there is provided an apparatus for generating audio, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to: the method of the first aspect or any implementation of the first aspect is performed.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium having instructions stored therein, which when executed by a processor of a terminal, enable the terminal to perform the method of the first aspect or any one of the embodiments of the first aspect.

The technical scheme provided by the embodiment of the disclosure can comprise the following beneficial effects: and extracting tone characteristics of the audio time domain data on the basis of the obtained original audio time domain data to obtain original tone characteristics. Further, target timbre audio time domain data is generated based on the original audio time domain data, the original timbre characteristics and the target timbre characteristics. Based on the method, in the process of audio conversion, the information loss of audio from time domain to frequency domain is avoided, and audio time domain data after tone color conversion is more vivid.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a flowchart illustrating a method of generating audio according to an exemplary embodiment.

Fig. 2 is a flow chart illustrating a method of generating target timbre audio time domain data according to an example embodiment.

Fig. 3 is a flow chart illustrating a method of generating target timbre audio time domain data according to an exemplary embodiment.

FIG. 4 is a flow chart illustrating one way of deriving semantic features of original audio time domain data according to an example embodiment.

FIG. 5 is a flowchart illustrating a pre-training generator according to an exemplary embodiment.

Fig. 6 shows a schematic diagram of timbre converted audio generation.

Fig. 7 is a block diagram illustrating an apparatus for generating audio according to an exemplary embodiment.

Fig. 8 is a block diagram illustrating an apparatus for generating audio according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure.

In the drawings, the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The described embodiments are some, but not all, embodiments of the present disclosure. The embodiments described below by referring to the drawings are exemplary and intended for the purpose of explaining the present disclosure and are not to be construed as limiting the present disclosure. Based on the embodiments in this disclosure, all other embodiments that a person of ordinary skill in the art would obtain without making any inventive effort are within the scope of protection of this disclosure. Embodiments of the present disclosure are described in detail below with reference to the attached drawings.

The method for generating the audio can be applied to the application fields of audio synthesis, man-machine interaction, virtual reality and the like, and particularly can relate to scenes such as music creation, game sound changing, sound reading, network live broadcast and the like. When people use more practical ways to create content or music, the final value of the product will far exceed the art creation process. For example, musicians may feel a sense of inspiration during creation, but may forget later, so that the melody hummed by the musicians can be played by using different musical instruments through the technology of the invention, the optimal playing mode of music is determined, and the melody of a score is adjusted, so that a musical composition with high expressive power is created. Secondly, the socialization of games is an important trend of the development of the game industry in recent years, and voice interaction in the games can be more entertaining by adding a sound-changing option to players in the games, so that the adhesion of users is improved by improving the social attribute of the games. Further, by combining the sound conversion technology, the user can select to speak the stories in the books with the mouth kisses of the own relatives, and the child can also select to speak the stories liked by the child with the mouth kisses of the favorite cartoon characters. In addition, the network anchor can change through tone color, so that language style characteristics are ensured, and meanwhile, the sound with different tone colors is selected according to different service scenes, so that not only can the sound become fun, but also the sound with the target tone color can be changed, and the interest of the network anchor can be increased.

Sound conversion techniques have a wide range of applications, among which audio tone conversion techniques are one type of sound conversion. Audio tone conversion is achieved by extracting semantic information and specific tone characteristics, which are not related to tone, from audio forms (timing, frequency spectrum, etc.) by a computer, and then combining the semantic information with different tone characteristics.

In the related art, a CQT spectrum of input audio is calculated through CQT conversion, and then the CQT spectrum is converted into a CQT spectrum of audio of a target field tone through a Cycle-GAN network, thereby realizing conversion of the audio CQT spectrum. The technology converts the CQT spectrum after tone color conversion into time domain audio through a pre-trained WaveNet network model, thereby generating target style audio after tone color conversion. In this case, this audio tone conversion method has the following two problems: in the first aspect, time domain data generated based on the spectrum after tone color conversion is not realistic enough. The method is characterized in that the time span of the audio time domain data is large, the number of audio sampling points with the duration of 1 second can reach 11052, the audio data is directly converted into frequency spectrum to realize tone conversion, partial information of audio is easy to lose, so that the audio after tone conversion has larger semantic difference with the input audio, and even a large amount of noise exists. The audio is converted from time sequence data into a frequency spectrum form, the frequency spectrum envelopes of different timbres do not obey the same peak mode at different pitches, and different overtones and harmonic frequencies need to be processed, so that the timbre characteristics and semantic characteristics extracted from the frequency spectrum image have great difficulty. In a second aspect, a trained model is incapable of achieving audio file conversion for multiple styles of timbres. Because the Cycle-GAN is used to realize the conversion of the CQT spectrum timbres, the trained model can only realize the conversion from one timbre to another timbre CQT spectrum, and if the input spectrum timbre is to be converted into N timbres, N different Cycle-GAN models need to be trained, resulting in a large workload.

In view of this, the present disclosure provides a method for generating audio, which extracts timbre characteristics of the audio time domain data on the basis of obtaining the obtained original audio time domain data, to obtain the original timbre characteristics. Further, target timbre audio time domain data is generated based on the original audio time domain data, the original timbre characteristics and the target timbre characteristics. Based on the method, in the process of audio conversion, the information loss of audio from time domain to frequency domain is avoided, and audio time domain data after tone color conversion is more vivid. Thus, the method of generating audio provided by the present disclosure is more flexible and realistic than the method of converting audio timbre in the related art.

Fig. 1 is a flowchart illustrating a method of generating audio according to an exemplary embodiment, and the method of generating audio is used in a terminal as shown in fig. 1, and includes the following steps.

In step S11, original audio time domain data is acquired.

In step S12, tone characteristics of the original audio time domain data are extracted, and the original tone characteristics are obtained.

Wherein, tone color means that different sounds always have different characteristics in terms of waveform, and different objects vibrate with different characteristics.

In the embodiment of the disclosure, the Mel frequency cepstrum coefficient (Mel-Frequency Cepstral Coefficients, MFCC) can be used for extracting the tone characteristic of the original audio time domain data to obtain the original tone characteristic t ₁ . It will be appreciated that the embodiments of the present disclosure are not specifically limited as to how to extract the timbre features in the audio.

In step S13, target timbre audio time domain data is generated based on the original audio time domain data, the original timbre characteristics, and the target timbre characteristics.

Wherein, the semantic features in the target tone color audio time domain data are matched with the semantic features of the original audio time domain data, and the tone features in the target tone color audio time domain data are matched with the target tone features t _n 。

In the present disclosure, raw audio time domain data is acquired. And extracting tone characteristics of the original audio time domain data to obtain the original tone characteristics. Target timbre audio time domain data is generated based on the original audio time domain data, the original timbre characteristics, and the target timbre characteristics. Through the method and the device, the audio after tone color conversion is ensured, the speaking content of the user in the original audio time domain data can be reserved, the original tone color characteristics can be accurately converted into the set target tone color characteristics, and the information loss in the tone color conversion process is solved.

In the following disclosed embodiments, a process of generating target tone color audio time domain data will be described in detail.

Fig. 2 is a flowchart illustrating a method of generating target timbre audio time-domain data according to an exemplary embodiment, as shown in fig. 2, based on original audio time-domain data, original timbre characteristics, and target timbre characteristics, the method including the steps of.

In step S21, the original audio features of the original audio time domain data are acquired, and the target tone color features are determined.

In an embodiment of the present disclosure, an original audio feature t of original audio time domain data is acquired ₁ And determining the target tone characteristic t _n . Wherein the process of determining the target timbre feature is essentially the process of building the target timbre feature dataset. In the process, the audio time domain data of partial speakers are selected from the data set containing different speakers, and then tone classification is carried out according to the identity information of the different speakers, so that the audio time domain data set with different tone is obtained. In the related art, tone characteristics and semantic characteristics are accurately obtained from audio time domain data, and then tone-transformed audio time domain data is generated by a generator. However, the representation of tone color features is difficult, and although a learner performs speaker classification on a spectrogram and then selects features before classification as tone color features, the method is too coarse, the difference exists between the information quantity of frequency spectrum and time domain data, and the extracted tone color features are not comprehensive and accurate enough. In the embodiment of the disclosure, the speaker audio file of the audio time domain data set is classified by training the WaveNet network model, when the WaveNet network model can accurately predict the speaker identity corresponding to the input audio, the characteristics extracted from the audio file by the model contain the information specific to the speaker, and the differences among different speakers are mainly tone color differences, that is, the characteristics obtained at this time are characteristics related to tone intensity, that is, tone color characteristics.

In step S22, the target timbre audio time domain data is generated based on the original audio time domain data, the original timbre features, and the target timbre features, and the pre-trained audio generation network model.

The audio generation network model is used for performing tone color conversion on the audio time domain data to generate tone color converted audio time domain data.

In an embodiment of the present disclosure, the audio generation network model includes a speech encoder, a timbre class classifier, a generator, and a discriminator. The network may be based on a given target timbre characteristic t _n And from audio time domain data X _{a_t1} Extracted semantic features S _a Generating and target tone color characteristic t _n Corresponding audio tone color is consistent and content is X _{a_t1} Consistent audio time domain data X _{a_tn} . Wherein X is _{a_t1} ，X _{a_tn} For audio time series data with length of 96000, semantic feature S _a The tone color is characterized by t _n Are feature vectors of length 128. Wherein X is _{a_tn} ＝G(X _{a_t1} |t _n )。t ₁ And t _n Respectively representing the timbre characteristics of the input audio and the timbre characteristics of the desired generated audio.

In the present disclosure, original audio features of original audio time domain data are obtained and target timbre features are determined. Generating a network model based on the original audio time domain data, the original tone characteristics and the target tone characteristics, and the pre-trained audio, and generating target tone audio time domain data. According to the method and the device, the network model is generated based on the pre-trained audio, accurate target tone audio time domain data are obtained according to target tone characteristics preset by a user, audio conversion of multiple tones can be obtained, and the network model is not required to be retrained for different tone conversions. For example, in a specific application scenario, according to the Raeli tone selected by the user, the normal audio of the speaker is converted into the Raeli tone audio for playing, so that the interestingness is increased.

FIG. 3 is a flowchart illustrating a method of generating target timbre audio time domain data, as shown in FIG. 3, based on original audio time domain data, original timbre features, and target timbre features, and a pre-trained audio generation network model, the method comprising the steps of.

In step S31, semantic features of the original audio time domain data are obtained based on the original audio time domain data and a semantic encoder included in the audio generation network model.

In the disclosed embodiment, the semantic encoder employs a WaveNet network model. The Wavenet network model is a sequence generation model and can be used for modeling voice generation. In acoustic model modeling of speech synthesis, the Wavenet can directly learn the mapping of the sampling value sequence, so that the acoustic model modeling has a good synthesis effect. The semantic encoder is trained to derive t from the timbre feature ₁ Semantic features S _a Audio time domain data X of (a) _{a_t1} Extracting semantic features S from the document _a S, i.e _a ＝E(X _{a_t1} ) That is, the pre-trained semantic encoder can obtain the original tone color features which do not contain the original audio time domain data, and only contain the semantic features of the original audio time domain data.

In step S32, target timbre audio time domain data is generated based on the semantic features, the original timbre features, the target timbre features, and the generators included in the audio generation network model.

Wherein the generator is for generating audio time domain data based on the semantic features and the audio features. The generator also employs a WaveNet network model.

In embodiments of the present disclosure, the generator implementationAnd then realize X _{a_t1} →X _{a_tn} Mapping of (2)Is mapped to the mapping of (a).

In the present disclosure, semantic features of original audio time domain data are obtained based on the original audio time domain data and a semantic encoder included in an audio generation network model. Based on semantic features, primitiveThe target timbre audio time domain data is generated from the timbre features, the target timbre features, and a generator included in the audio generation model. By the method and the device, the semantic features of clean original audio time domain data are obtained, and mapping from the original audio time domain data to predicted original audio time domain data is realizedMapping X of original audio time domain data to target timbre audio time domain data _{a_t1} →X _{a_tn} And mapping of target timbre audio time domain data to predicted original audio time domain data

FIG. 4 is a flowchart illustrating a method of deriving semantic features of original audio time-domain data, as shown in FIG. 4, based on the original audio time-domain data and a semantic encoder included in an audio generation network model, according to an exemplary embodiment, comprising the following steps.

In step S41, the original audio time domain data is input to a semantic encoder included in the audio generation network model, and the semantic features output by the semantic encoder are input to a tone color class classifier included in the audio generation network model.

Wherein the tone color class classifier is used for identifying the class of tone color features. For example, guitar, piano, violin, wherein the tone class classifier is based on semantic features S _a Tone characteristic t contained in the color ₁ Judging character identity information corresponding to the audio file and used for restraining semantic features S extracted by a semantic encoder _a As little as possible tone characteristic t ₁ 。

In step S42, the semantic encoder is constrained based on the output of the timbre class classifier, so that the timbre feature corresponding class included in the semantic features output by the semantic encoder is different from the original timbre feature corresponding class feature, and the semantic features of the original audio time domain data are obtained.

In the disclosed embodiments, countermeasure training is performed between the semantic encoder and the tone class classifier. Training a tone class classifier, and obtaining original audio time domain data X _{a_t1} Category I of audio files _c Input into an untrained tone class classifier, and fight against loss function L according to domain _cls And optimizing the timbre class classifier according to the calculated loss value in a mode of minimizing the loss value until the timbre class classifier converges, and obtaining the trained timbre class classifier. Training the semantic encoder, fixing the network weight parameters of the tone class classifier, and obtaining the original audio time domain data X _{a_t1} Category I of audio files _c Inputting the voice characteristics into an untrained semantic encoder to obtain semantic characteristics containing voice characteristics. Inputting semantic features containing tone features into a trained tone class classifier, and antagonizing a loss function L according to a domain _cls And optimizing the semantic encoder according to the calculated loss value in a mode of maximizing the loss value until the semantic encoder converges, and obtaining the semantic encoder after training.

In the disclosed embodiment, domain contrast loss function L is used in training a semantic encoder and a tone class classifier _cls A penalty value is calculated to ensure that the semantic encoder extracts semantic information features from the audio time series data that are independent of the timbre information features. Wherein the domain combat loss function L _cls Represented as For tone class classifiers, the aim is to be able to rely on semantic features S _a Accurately judge the audio stationThe identity information of the belonging person will force the semantic encoder to extract the timbre information from the audio time domain data, thus the goal of training the timbre class classifier is to minimize the loss function L _cls . While the purpose of the semantic encoder is to expect extracted semantic features S _a Does not contain tonal information features, so the purpose of training the semantic encoder is to maximize the loss function L _cls 。

In the present disclosure, original audio time domain data is input to a semantic encoder included in an audio generation network model, and semantic features output by the semantic encoder are input to a timbre class classifier included in the audio generation network model. And constraining the semantic encoder based on the output of the tone color class classifier so that the tone color feature corresponding class included in the semantic features output by the semantic encoder is different from the original tone color feature corresponding class feature, and obtaining the semantic features of the original audio time domain data. Through the method and the device, the semantic encoder can finally extract corresponding semantic features from the input audio through the countermeasure training of the tone class classifier and the semantic encoder.

FIG. 5 is a flow chart illustrating a pre-training generator, as shown in FIG. 5, according to an exemplary embodiment, the generator is pre-trained in the following manner, including the following steps.

In step S51, the first audio time domain data is input to a semantic encoder to obtain first audio semantic features, and the first audio semantic features and the target timbre features are input to a prediction generator to obtain target timbre audio time domain prediction data.

In step S52, the target timbre audio time-domain prediction data is input to the semantic encoder, semantic features of the target timbre audio time-domain prediction data are obtained, and the semantic features and the timbre features of the first audio are input to the prediction generator, so as to obtain the second audio time-domain prediction data.

In the present disclosure, the true/false countermeasure loss and the tone characteristic regression loss are determined by a preset discriminator based on the first audio time domain data and the corresponding tone characteristic. Based on the target timbre audio time domain prediction data and the target timbre characteristics, true/false countermeasure losses, and timbre characteristic regression losses are determined by a discriminator. A reconstruction loss is determined based on the first audio time domain data, the second audio time domain prediction data. Based on the true/false countermeasures loss, the tone characteristic regression loss and the reconstruction loss, training of the prediction generator is constrained, and a generator meeting constraint conditions is obtained.

In the disclosed embodiment, the countermeasure training is performed between the generator and the determiner. Firstly, training a judging device, wherein each training comprises two groups of inputs: the first group is the original audio frequency time domain data X _{a_t1} And original timbre characteristics t ₁ The second set of inputs is a generator based on the input original audio time domain data X _{a_t1} And target timbre feature t _n The generated target tone color audio time domain data X after tone color conversion _{a_tn} And target timbre feature t _n . And optimizing network parameters of the discriminators according to the true/false countermeasures loss and the tone characteristic regression loss until the discriminators converge to obtain the discriminators with the trained discriminators, wherein the output of the discriminators is the vector of each group of input predictions and the true/false probability value. The generator is trained again, each training comprising three sets of inputs: the first group of inputs is original audio time domain data X _{a_t1} And target timbre feature t _n Outputting target tone color audio time domain data X _{a_tn} X is taken as _{a_tn} Inputting the true/false probability value and the tone characteristic into a discriminator, and predicting the true/false probability value and the tone characteristic; the second set of input is original audio time domain data X _{a_t1} And original timbre characteristics t ₁ Outputting reconstructed original audio time domain dataFor calculating reconstruction loss L _rec A second item of (a); the third group of input is target tone color audio time domain data X _{a_tn} And original timbre characteristics t ₁ Outputting reconstructed original audio time domain dataFor calculating reconstruction loss L _rec Is a first item of (a). The loss value is then calculated from the loss function, and then the network parameters of the generator are optimized and updated by back propagation.

In the disclosed embodiments, the timbre feature regression loss function, true/false challenge loss function, and reconstruction loss function are used as loss functions in training the judger and generator. The timbre characteristic regression loss for matching the timbre attribute of the audio generated by the generator to a given timbre characteristic is expressed as a loss function Wherein L is _t The first term in (a) is a discriminator to predict the input original audio time domain data X _{a_t1} And an original tone characteristic t corresponding to the original audio time sequence data ₁ And L2 regression loss is performed, so that the tone characteristic prediction capability of the discriminator is improved. The second term is to predict the audio time domain data G (X) _{a_t1} |t _n ) And is matched with the given desired target tone characteristic t _n And (3) carrying out L2 regression loss to force tone characteristics of the audio time domain data generated by the generator to be consistent with given target tone characteristics as much as possible.

In the disclosed embodiment, the true/false countermeasure loss function is to adjust the audio time domain data generated by the generator to be consistent with the real audio time domain data distribution as much as possible, and the loss function is expressed as L _I (G,D,X _{a_t1} ,t _n )＝E[log(D(X _{a_t1} ))]+E[log(1-D(G(X _{a_t1} |t _n ))]Wherein for the input audio frequencyThe domain data, the discriminator expects to output a true-false predicted value close to 1. For the audio time domain data generated by the generator, the discriminator expects to output a true-false prediction value close to 0. During initial training, the discriminator outputs a probability prediction value close to 0 according to the audio time domain data generated by the generator, and the discriminator expects to output an true-false probability value close to 1 for the audio time domain data generated by the trained generator because the generating capacity of the generator is weak. Furthermore, the audio time domain data that the generator wishes to generate is as similar as possible to the real audio time domain data, i.e. the arbiter will output a probability prediction value close to 1 from the audio file generated by the generator, the final goal of the generator being to minimize the true/false challenge loss function, and the goal of the arbiter being to maximize the true/false challenge loss function.

In the disclosed embodiment, a loss function is reconstructed to make the audio time domain data generated by the generator consistent with the input audio time domain data in terms of semantic information, and only the tone characteristics are changed, wherein the loss function is expressed as Wherein G (G (X) _{a_t1} |t _n )|t ₁ ) Representation generator based on input of original audio time domain data X _{a_t1} Semantic features S of (2) _a And a desired target tone characteristic t _n Target tone color audio time domain data X after tone color attribute post-conversion generation _{a_tn} Then extracting X by an encoder _{a_tn} Semantic features of (2) then match the original timbre features t ₁ Input to the generator together to output audio time domain dataIn addition, the generator also generates the input original audio time domain data X _{a_t1} Semantic features S of (2) _a And corresponding tone color feature t ₁ To reconstruct the input audio time domain data. Such asIf the fruit generator is able to keep the semantic information unchanged while changing the timbre properties, then the reconstructed audio time domain data should be very similar to the input audio time domain data. Thus, the difference value of the reconstructed audio time domain data and the input audio time domain data is calculated by the L2 regression loss, forcing the generator to keep the semantic features unchanged when generating the audio time domain data. Furthermore lambda ₁ Is a super parameter and is used for measuring the importance of reconstruction loss at different stages.

In the embodiment of the disclosure, the total loss function of the audio generation method of timbre transformation includes domain countermeasure loss, timbre characteristic regression loss, reconstruction loss and true/false countermeasure loss, the weights of the different loss functions differ, and the final loss function L is expressed as Wherein lambda is _cls 、λ _t 、λ _rec And lambda (lambda) _I Is a hyper-parameter that controls the relative importance of each loss. Finally, the training of the entire network may be defined as a minimization of maximization problem of standard generation countermeasure networks: wherein G is ^＊ An audio generation network representing a timbre transformation,representing network weight parameters of the optimization generator aiming at minimizing the loss value of the loss function L;the network weight parameters of the optimized discriminant and the tone class classifier are represented with the goal of maximizing the loss value of the loss function L.

In the disclosure, original semantic training features and original audio training features of audio time domain training data are input to a prediction generator to obtain first audio time domain prediction data. And inputting the original semantic training features and the target audio training features of the audio time domain training data into a prediction generator to obtain second audio time domain prediction data. And inputting the predicted semantic training features and the original audio training features of the second audio time domain predicted data into a prediction generator to obtain third audio time domain predicted data. Through the method and the device, the generator can effectively combine semantic features with tone features and generate tone-transformed audio time domain data, the tone features of the data are consistent with given target tone features, and the semantic features are consistent with the audio time domain data input by the semantic encoder.

Fig. 6 shows a schematic diagram of timbre converted audio generation. As shown in fig. 6, the timbre characteristics of the target speaker are obtained through the trained WaveNet classification model, and then the characteristics and the original audio time domain data are input into the audio generation network for timbre conversion, so that audio time domain data after timbre conversion are generated, the conversion from a single audio file to audio files with multiple timbres can be realized based on the trained model, the network model does not need to be retrained for different conversion scenes, and the generalization performance of the model is strong.

It should be understood by those skilled in the art that the various implementations/embodiments of the present disclosure may be used in combination with the foregoing embodiments or may be used independently. Whether used alone or in combination with the previous embodiments, the principles of implementation are similar. In the practice of the present disclosure, some of the examples are described in terms of implementations that are used together. Of course, those skilled in the art will appreciate that such illustration is not limiting of the disclosed embodiments.

Based on the same conception, the embodiment of the disclosure also provides a device for generating audio.

It will be appreciated that, in order to implement the above-described functions, the apparatus for generating audio provided in the embodiments of the present disclosure includes corresponding hardware structures and/or software modules that perform the respective functions. The disclosed embodiments may be implemented in hardware or a combination of hardware and computer software, in combination with the various example elements and algorithm steps disclosed in the embodiments of the disclosure. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Those skilled in the art may implement the described functionality using different approaches for each particular application, but such implementation is not to be considered as beyond the scope of the embodiments of the present disclosure.

Fig. 7 is a block diagram illustrating an apparatus for generating audio according to an exemplary embodiment. Referring to fig. 7, the apparatus 100 includes an acquisition unit 101, an extraction unit 102, and a generation unit 103.

An acquisition unit 101 for acquiring original audio time domain data; the extracting unit 102 is configured to extract tone characteristics of the original audio time domain data to obtain original tone characteristics; the generating unit 103 is configured to generate target timbre audio time domain data based on the original audio time domain data, the original timbre feature and the target timbre feature, where the semantic feature in the target timbre audio time domain data matches the semantic feature of the original audio time domain data, and the timbre feature in the target timbre audio time domain data matches the target timbre feature.

In one embodiment, the generating unit 103 generates the target timbre audio time domain data based on the original audio time domain data, the original timbre feature, and the target timbre feature in the following manner: generating network model based on the original audio time domain data, the original tone characteristic, the target tone characteristic and the pre-trained audio, and generating target tone audio time domain data; the audio generation network model is used for performing tone color conversion on the audio time domain data to generate tone color converted audio time domain data.

In one embodiment, the generating unit 103 generates the target timbre audio time domain data based on the original audio time domain data, the original timbre feature, and the target timbre feature, and the pre-trained audio generation network model in the following manner: based on the original audio time domain data and a semantic encoder included in the audio generation network model, obtaining semantic features of the original audio time domain data; generating target timbre audio time domain data based on the semantic features, the original timbre features, the target timbre features, and a generator included in the audio generation model; the generator is for generating audio time domain data based on the semantic features and the audio features.

In one embodiment, the generating unit 103 obtains semantic features of the original audio time domain data based on the original audio time domain data and a semantic encoder included in the audio generation network model in the following manner: inputting the original audio time domain data to a semantic encoder included in the audio generation network model, and inputting semantic features output by the semantic encoder to a tone class classifier included in the audio generation network model; the tone color class classifier is used for identifying tone color classes of the input semantic features; and constraining the semantic encoder based on the output of the tone class classifier so that the semantic features output by the semantic encoder do not contain any tone features, thereby obtaining the semantic features of the original audio time domain data.

In one embodiment, the generator is pre-trained in the following manner: inputting the first audio time domain data to a semantic encoder to obtain first audio semantic features, and inputting the first audio semantic features and target tone features to a prediction generator to obtain target tone audio time domain prediction data; inputting the target tone audio time domain prediction data to a semantic encoder to obtain semantic features of the target tone audio time domain prediction data, and inputting the semantic features and the tone features of the first audio to a prediction generator to obtain second audio time domain prediction data; determining true/false countermeasure loss and tone characteristic regression loss by a preset discriminator based on the first audio time domain data and the corresponding tone characteristic; determining true/false countermeasure loss and timbre feature regression loss by a discriminator based on the target timbre audio time domain prediction data and the target timbre feature; determining a reconstruction loss based on the first audio time domain data and the second audio time domain prediction data; based on the true/false countermeasures loss, the tone characteristic regression loss and the reconstruction loss, training of the prediction generator is constrained, and a generator meeting constraint conditions is obtained.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Fig. 8 is a block diagram illustrating an apparatus for generating audio according to an exemplary embodiment. For example, apparatus 200 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.

Referring to fig. 8, the apparatus 200 may include one or more of the following components: a processing component 202, a memory 204, a power component 206, a multimedia component 208, an audio component 210, an input/output (I/O) interface 212, a sensor component 214, and a communication component 216.

The processing component 202 generally controls overall operation of the apparatus 200, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 202 may include one or more processors 220 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 202 can include one or more modules that facilitate interactions between the processing component 202 and other components. For example, the processing component 202 may include a multimedia module to facilitate interaction between the multimedia component 208 and the processing component 202.

The memory 204 is configured to store various types of data to support operations at the apparatus 200. Examples of such data include instructions for any application or method operating on the device 200, contact data, phonebook data, messages, pictures, videos, and the like. The memory 204 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The power component 206 provides power to the various components of the device 200. The power components 206 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the device 200.

The multimedia component 208 includes a screen between the device 200 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 208 includes a front-facing camera and/or a rear-facing camera. The front camera and/or the rear camera may receive external multimedia data when the apparatus 200 is in an operation mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 210 is configured to output and/or input audio signals. For example, the audio component 210 includes a Microphone (MIC) configured to receive external audio signals when the device 200 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 204 or transmitted via the communication component 216. In some embodiments, audio component 210 further includes a speaker for outputting audio signals.

The I/O interface 212 provides an interface between the processing assembly 202 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly 214 includes one or more sensors for providing status assessment of various aspects of the apparatus 200. For example, the sensor assembly 214 may detect the on/off state of the device 200, the relative positioning of the components, such as the display and keypad of the device 200, the sensor assembly 214 may also detect a change in position of the device 200 or a component of the device 200, the presence or absence of user contact with the device 200, the orientation or acceleration/deceleration of the device 200, and a change in temperature of the device 200. The sensor assembly 214 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. The sensor assembly 214 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 214 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 216 is configured to facilitate communication between the apparatus 200 and other devices in a wired or wireless manner. The device 200 may access a wireless network based on a communication standard, such as WiFi,4G or 5G, or a combination thereof. In one exemplary embodiment, the communication component 216 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 216 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 200 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 204, including instructions executable by processor 220 of apparatus 200 to perform the above-described method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

It is understood that the term "plurality" in this disclosure means two or more, and other adjectives are similar thereto. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. The singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It is further understood that the terms "first," "second," and the like are used to describe various information, but such information should not be limited to these terms. These terms are only used to distinguish one type of information from another and do not denote a particular order or importance. Indeed, the expressions "first", "second", etc. may be used entirely interchangeably. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure.

It will be further understood that "connected" includes both direct connection where no other member is present and indirect connection where other element is present, unless specifically stated otherwise.

It will be further understood that although operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the scope of the appended claims.

Claims

A method of generating audio, the method comprising:

acquiring original audio time domain data;

extracting tone characteristics of the original audio time domain data to obtain original tone characteristics;

generating target timbre audio time domain data based on the original audio time domain data, the original timbre characteristics and target timbre characteristics, wherein semantic characteristics in the target timbre audio time domain data are matched with the semantic characteristics of the original audio time domain data, and timbre characteristics in the target timbre audio time domain data are matched with the target timbre characteristics.
The method of claim 1, wherein the generating target timbre audio time domain data based on the original audio time domain data, the original timbre features, and target timbre features comprises:

generating network models based on the original audio time domain data, the original tone characteristics, the target tone characteristics and pre-trained audio, and generating target tone audio time domain data;

the audio generation network model is used for performing tone color conversion on the audio time domain data to generate tone color converted audio time domain data.
The method of claim 2, wherein the generating target timbre audio time domain data based on the original audio time domain data, the original timbre features, and target timbre features, and a pre-trained audio generation network model comprises:

Obtaining semantic features of the original audio time domain data based on the original audio time domain data and a semantic encoder included in an audio generation network model;

generating target timbre audio time domain data based on the semantic features, the original timbre features, the target timbre features, and a generator included in the audio generation network model;

the generator is for generating audio time domain data based on the semantic features and the audio features.
A method according to claim 3, wherein said deriving semantic features of said original audio time domain data based on said original audio time domain data and a semantic encoder included in an audio generation network model comprises:

inputting the original audio time domain data to a semantic encoder included in an audio generation network model, and inputting semantic features output by the semantic encoder to a tone class classifier included in the audio generation network model;

the tone color class classifier is used for identifying tone color classes of input semantic features;

and constraining the semantic encoder based on the output of the tone class classifier, so that the semantic features output by the semantic encoder do not contain any tone features, and obtaining the semantic features of the original audio time domain data.
A method according to claim 3 or 4, characterized in that the generator is pre-trained in the following way:

inputting the first audio time domain data into the semantic encoder to obtain first audio semantic features, and inputting the first audio semantic features and target tone features into a prediction generator to obtain target tone audio time domain prediction data;

inputting the target tone color audio time domain prediction data to the semantic encoder to obtain semantic features of the target tone color audio time domain prediction data, and inputting the semantic features and the tone color features of the first audio to a prediction generator to obtain second audio time domain prediction data;

determining true/false countermeasure loss and tone characteristic regression loss by a preset discriminator based on the first audio time domain data and the corresponding tone characteristic;

determining true/false countermeasure loss and timbre feature regression loss by a discriminator based on the target timbre audio time domain prediction data and the target timbre feature;

determining a reconstruction loss based on the first audio time domain data, the second audio time domain prediction data;

and constraining the training of the prediction generator based on the true/false countermeasure loss, the tone characteristic regression loss and the reconstruction loss to obtain a generator meeting constraint conditions.
An apparatus for generating audio, comprising:

the acquisition unit is used for acquiring the original audio time domain data;

the extraction unit is used for extracting tone characteristics of the original audio time domain data to obtain original tone characteristics;

the generating unit is used for generating target tone color audio time domain data based on the original audio time domain data, the original tone color characteristics and the target tone color characteristics, semantic characteristics in the target tone color audio time domain data are matched with the semantic characteristics of the original audio time domain data, and tone color characteristics in the target tone color audio time domain data are matched with the target tone color characteristics.
The apparatus of claim 6, wherein the generating unit generates target timbre audio time-domain data based on the original audio time-domain data, the original timbre feature, and a target timbre feature in the following manner:

generating network models based on the original audio time domain data, the original tone characteristics, the target tone characteristics and pre-trained audio, and generating target tone audio time domain data;

the audio generation network model is used for performing tone color conversion on the audio time domain data to generate tone color converted audio time domain data.
The apparatus of claim 7, wherein the generating unit generates the target timbre audio time domain data based on the original audio time domain data, the original timbre features, and target timbre features, and a pre-trained audio generation network model by:

obtaining semantic features of the original audio time domain data based on the original audio time domain data and a semantic encoder included in an audio generation network model;

generating target timbre audio time domain data based on the semantic features, the original timbre features, the target timbre features, and a generator included in the audio generation network model;

the generator is for generating audio time domain data based on the semantic features and the audio features.
The apparatus according to claim 8, wherein the generating unit obtains semantic features of the original audio time domain data based on the original audio time domain data and a semantic encoder included in an audio generation network model in such a manner that:

inputting the original audio time domain data to a semantic encoder included in an audio generation network model, and inputting semantic features output by the semantic encoder to a tone class classifier included in the audio generation network model;

The tone color class classifier is used for identifying tone color classes of input semantic features;

and constraining the semantic encoder based on the output of the tone class classifier, so that the semantic features output by the semantic encoder do not contain any tone features, and obtaining the semantic features of the original audio time domain data.
The apparatus of claim 8 or 9, wherein the generator is pre-trained in the following manner:

inputting the first audio time domain data into the semantic encoder to obtain first audio semantic features, and inputting the first audio semantic features and target tone features into a prediction generator to obtain target tone audio time domain prediction data;

inputting the target tone color audio time domain prediction data to the semantic encoder to obtain semantic features of the target tone color audio time domain prediction data, and inputting the semantic features and the tone color features of the first audio to a prediction generator to obtain second audio time domain prediction data;

determining true/false countermeasure loss and tone characteristic regression loss by a preset discriminator based on the first audio time domain data and the corresponding tone characteristic;

Determining true/false countermeasure loss and timbre feature regression loss by a discriminator based on the target timbre audio time domain prediction data and the target timbre feature; determining a reconstruction loss based on the first audio time domain data, the second audio time domain prediction data;

and constraining the training of the prediction generator based on the true/false countermeasure loss, the tone characteristic regression loss and the reconstruction loss to obtain a generator meeting constraint conditions.
An apparatus for generating audio, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the method of any one of claims 1 to 5.
A storage medium having instructions stored therein which, when executed by a processor, enable the processor to perform the method of any one of claims 1 to 5.