WO2023236054A1 - 一种生成音频的方法、装置及存储介质 - Google Patents

一种生成音频的方法、装置及存储介质 Download PDF

Info

Publication number
WO2023236054A1
WO2023236054A1 PCT/CN2022/097437 CN2022097437W WO2023236054A1 WO 2023236054 A1 WO2023236054 A1 WO 2023236054A1 CN 2022097437 W CN2022097437 W CN 2022097437W WO 2023236054 A1 WO2023236054 A1 WO 2023236054A1
Authority
WO
WIPO (PCT)
Prior art keywords
timbre
time domain
domain data
audio
audio time
Prior art date
Application number
PCT/CN2022/097437
Other languages
English (en)
French (fr)
Inventor
张�浩
王凯
尹旭东
史润宇
Original Assignee
北京小米移动软件有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京小米移动软件有限公司 filed Critical 北京小米移动软件有限公司
Priority to PCT/CN2022/097437 priority Critical patent/WO2023236054A1/zh
Priority to CN202280004612.0A priority patent/CN117546238A/zh
Publication of WO2023236054A1 publication Critical patent/WO2023236054A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants

Definitions

  • the present disclosure relates to the field of audio technology, and in particular, to a method, device and storage medium for generating audio.
  • Sound conversion technology has a wide range of application scenarios, among which audio timbre conversion technology is a type of sound conversion.
  • audio timbre conversion uses computers to extract semantic information and specific timbre features that are unrelated to timbre from audio forms (timing, spectrum, etc.), and then combines the semantic information with different timbre features to achieve audio timbre conversion.
  • the audio timing data is converted into spectrum data
  • the audio timbre is changed by changing the style of the spectrum image
  • the spectrum data after timbre conversion is converted into timing data.
  • the audio time domain data after timbre transformation is not realistic enough.
  • the present disclosure provides a method, device and storage medium for generating audio.
  • a method of generating audio including:
  • generating target timbre audio time domain data based on the original audio time domain data, the original timbre characteristics and the target timbre characteristics includes: based on the original audio time domain data, the original timbre Features and target timbre characteristics, as well as a pre-trained audio generation network model, generate target timbre audio time domain data; the audio generation network model is used to perform timbre conversion on audio time domain data to generate audio time domain data after timbre conversion.
  • generating target timbre audio time domain data based on the original audio time domain data, the original timbre features and target timbre features, and a pre-trained audio generation network model includes: based on the original The audio time domain data, and the semantic encoder included in the audio generation network model, obtain the semantic features of the original audio time domain data; based on the semantic features, the original timbre features, the target timbre features, and the The generator included in the audio generation network model generates target timbre audio time domain data; the generator is used to generate audio time domain data based on semantic features and timbre features.
  • obtaining the semantic features of the original audio time domain data based on the original audio time domain data and the semantic encoder included in the audio generation network model includes: converting the original audio time domain data into Data is input to a semantic encoder included in the audio generation network model, and the semantic features output by the semantic encoder are input to a timbre category classifier included in the audio generation network model; the timbre category classifier is used to identify the input The timbre category of the semantic feature; constraining the semantic encoder based on the output of the timbre category classifier so that the semantic features output by the semantic encoder do not contain any timbre features, and obtaining the original audio time domain data semantic features.
  • the generator is trained in the following manner: input the first audio time domain data to the semantic encoder, obtain the first audio semantic feature, and combine the first audio semantic feature and the target timbre feature. , input to the prediction generator to obtain the target timbre audio time domain prediction data; input the target timbre audio time domain prediction data to the semantic encoder, obtain the semantic features of the target timbre audio time domain prediction data, and use the semantic features
  • the timbre characteristics of the first audio are input to the prediction generator to obtain the second audio time domain prediction data; based on the first audio time domain data and the corresponding timbre characteristics, the true/false adversarial loss, and timbre feature regression loss; based on the target timbre audio time domain prediction data and the target timbre feature, through the discriminator, determine the true/false adversarial loss and timbre feature regression loss; based on the first audio time domain data, the second audio time domain data, and the timbre feature regression loss.
  • Audio time domain prediction data is used to determine the reconstruction loss; based on the true/false adversarial loss, the timbre feature regression loss and the reconstruction loss, the training of the prediction generator is constrained to obtain a generator that satisfies the constraint conditions.
  • an apparatus for generating audio including:
  • the acquisition unit is used to obtain the original audio time domain data; the extraction unit is used to extract the timbre characteristics of the original audio time domain data to obtain the original timbre characteristics; the generation unit is used to generate the original audio time domain data based on the original audio time domain data.
  • Original timbre features and target timbre features generate target timbre audio time domain data.
  • the semantic features in the target timbre audio time domain data match the semantic features of the original audio time domain data.
  • the target timbre audio time domain data contains The timbre characteristics match the target timbre characteristics.
  • the generating unit generates target timbre audio time domain data based on the original audio time domain data, the original timbre characteristics and the target timbre characteristics in the following manner: based on the original audio time domain data, the original timbre characteristics and the target timbre characteristics.
  • the generation unit uses the following method to generate target timbre audio time domain data based on the original audio time domain data, the original timbre characteristics and the target timbre characteristics, and a pre-trained audio generation network model: based on The original audio time domain data, and the semantic encoder included in the audio generation network model, obtain the semantic features of the original audio time domain data; based on the semantic features, the original timbre features, the target timbre features, and a generator included in the audio generation network model to generate target timbre audio time domain data; the generator is used to generate audio time domain data corresponding to the timbre based on semantic features and timbre features.
  • the generation unit obtains the semantic features of the original audio time-domain data based on the original audio time-domain data and the semantic encoder included in the audio generation network model in the following manner: converting the original audio time-domain data into The audio time domain data is input to the semantic encoder included in the audio generation network model, and the semantic features output by the semantic encoder are input to the timbre category classifier included in the audio generation network model; the timbre category classifier is used To identify the category of timbre features; constrain the semantic encoder based on the output of the timbre category classifier so that the semantic features output by the semantic encoder do not include timbre features, and obtain the original audio time domain data semantic features.
  • the generator is pre-trained in the following manner: input the first audio time domain data to the semantic encoder, obtain the first audio semantic features, and combine the first audio semantic features and the target timbre Features are input to the prediction generator to obtain the target timbre audio time domain prediction data; the target timbre audio time domain prediction data is input to the semantic encoder, the semantic features of the target timbre audio time domain prediction data are obtained, and the semantic The characteristics and the timbre characteristics of the first audio are input to the prediction generator to obtain the second audio time domain prediction data; based on the first audio time domain data and the corresponding timbre characteristics, the true/false adversarial loss is determined through the preset discriminator , and timbre feature regression loss; based on the target timbre audio time domain prediction data and the target timbre feature, through the discriminator, determine the true/false adversarial loss and timbre feature regression loss; based on the first audio time domain data, the third audio time domain data, and the timbre feature regression loss.
  • Two audio time domain prediction data are used to determine the reconstruction loss; based on the true/false adversarial loss, the timbre feature regression loss and the reconstruction loss, constraints are placed on the training of the prediction generator to obtain a generator that satisfies the constraints. .
  • a device for generating audio including:
  • Memory used to store instructions executable by the processor
  • the processor is configured to: execute the method described in the first aspect or any implementation manner of the first aspect.
  • a computer-readable storage medium is provided. Instructions are stored in the storage medium. When the instructions in the storage medium are executed by a processor of the terminal, the terminal can execute the first aspect. Or the method described in any embodiment of the first aspect.
  • the technical solution provided by the embodiments of the present disclosure may include the following beneficial effects: on the basis of obtaining the original audio time domain data, extract the timbre features of the audio time domain data to obtain the original timbre features. Further, based on the original audio time domain data, original timbre features and target timbre features, target timbre audio time domain data is generated. Based on this, during the audio conversion process, the loss of audio information converted from the time domain to the frequency domain is avoided, making the audio time domain data after timbre conversion more realistic.
  • FIG. 1 is a flowchart of a method of generating audio according to an exemplary embodiment.
  • Figure 2 is a flowchart illustrating a method of generating target timbre audio time domain data according to an exemplary embodiment.
  • Figure 3 is a flowchart illustrating a method of generating target timbre audio time domain data according to an exemplary embodiment.
  • Figure 4 is a flowchart illustrating a method of obtaining semantic features of original audio time domain data according to an exemplary embodiment.
  • Figure 5 is a flow chart of a pre-trained generator according to an exemplary embodiment.
  • Figure 6 shows a schematic diagram of timbre conversion audio generation.
  • Figure 7 is a block diagram of a device for generating audio according to an exemplary embodiment.
  • FIG. 8 is a block diagram of a device for generating audio according to an exemplary embodiment.
  • the method for generating audio provided by the embodiments of the present disclosure can be applied to application fields such as audio synthesis, human-computer interaction, and virtual reality, and can especially involve scenarios such as music creation, game voice changing, audio reading, and online live broadcasts.
  • application fields such as audio synthesis, human-computer interaction, and virtual reality
  • scenarios such as music creation, game voice changing, audio reading, and online live broadcasts.
  • the technology of the present invention can be used to play the melody hummed by the musician using different instruments to determine the best way to play the music and adjust the score. melody to create expressive musical compositions.
  • game socialization is an important trend in the development of the game industry in recent years.
  • Adding the option of changing the voice of players in the game can make the voice interaction in the game more entertaining, and improve user stickiness by improving the social attributes of the game.
  • users can choose to tell the stories in the books in the tone of their loved ones, and children can also choose to tell their favorite stories in the tone of their favorite anime characters.
  • online anchors can use timbre conversion to ensure the characteristics of language style and choose different timbres according to different business scenarios. Not only can they become entertaining and funny sounds, but they can also become voices with target timbres, etc., which can increase the number of online live broadcasts. of fun.
  • Audio timbre conversion technology uses computers to extract semantic information and specific timbre features that are unrelated to timbre from audio forms (timing, spectrum, etc.), and then combines the semantic information with different timbre features to achieve audio timbre conversion.
  • the CQT spectrum of the input audio is calculated through CQT conversion, and then the CQT spectrum is converted into the CQT spectrum of the audio of the target domain timbre through the Cycle-GAN network, thereby realizing the conversion of the audio CQT spectrum.
  • This technology converts the timbre-transformed CQT spectrum into time-domain audio through the pre-trained WaveNet network model, thereby generating the target style audio after timbre transformation.
  • this audio timbre conversion method has the following two problems: First, the time domain data generated based on the spectrum after timbre transformation is not realistic enough. This is due to the large time span of audio time domain data, and the number of audio sampling points with a duration of 1 second can reach 11052.
  • the present disclosure provides a method for generating audio.
  • the timbre features of the audio time domain data are extracted to obtain the original timbre features.
  • original timbre features and target timbre features target timbre audio time domain data is generated. Based on this, during the audio conversion process, the loss of audio information converted from the time domain to the frequency domain is avoided, making the audio time domain data after timbre transformation more realistic. Therefore, compared with the method of converting audio timbre in the related art, the method of generating audio provided by the present disclosure is more flexible and realistic.
  • Figure 1 is a flow chart of a method of generating audio according to an exemplary embodiment. As shown in Figure 1, the method of generating audio is used in a terminal and includes the following steps.
  • step S11 original audio time domain data is obtained.
  • step S12 the timbre features of the original audio time domain data are extracted to obtain the original timbre features.
  • timbre means that different sounds always have distinctive characteristics in terms of waveforms, and different objects vibrate with different characteristics.
  • MFCC Mel-Frequency Cepstral Coefficients
  • step S13 target timbre audio time domain data is generated based on the original audio time domain data, original timbre features and target timbre features.
  • the semantic features in the target timbre audio time domain data match the semantic features of the original audio time domain data
  • the timbre features in the target timbre audio time domain data match the target timbre features t n .
  • raw audio time domain data is obtained. Extract the timbre features of the original audio time domain data to obtain the original timbre features. Based on the original audio time domain data, original timbre features and target timbre features, target timbre audio time domain data is generated.
  • Figure 2 is a flow chart for generating target timbre audio time domain data according to an exemplary embodiment. As shown in Figure 2, based on the original audio time domain data, original timbre features and target timbre features, the target timbre audio time is generated. Domain data, including the following steps.
  • step S21 the original audio features of the original audio time domain data are obtained, and the target timbre features are determined.
  • the original audio feature t 1 of the original audio time domain data is obtained, and the target timbre feature t n is determined.
  • the process of determining the target timbre characteristics is essentially the process of building a target timbre feature data set.
  • the audio time domain data of some speakers are selected from the data set containing different speakers, and then the timbre is classified according to the identity information of the different speakers to obtain audio time domain data sets with different timbres.
  • timbre features and semantic features are accurately obtained from audio time domain data, and then the timbre transformed audio time domain data is generated through a generator. However, it is difficult to represent timbre features.
  • the WaveNet network model is trained to classify the speaker audio files of the audio time domain data set.
  • the features extracted by the model from the audio file include The information unique to the speaker is obtained, and the difference between different speakers is mainly the difference in timbre. That is to say, the features obtained at this time are features that are strongly related to timbre, that is, timbre features.
  • step S22 target timbre audio time domain data is generated based on the original audio time domain data, original timbre features and target timbre features, and the pre-trained audio generation network model.
  • the audio generation network model is used to perform timbre conversion on audio time domain data to generate audio time domain data after timbre conversion.
  • the audio generation network model includes a speech encoder, a timbre category classifier, a generator and a discriminator.
  • This network can generate an audio time domain that is consistent with the audio timbre corresponding to the target timbre feature t n and whose content is consistent with X a_t1 based on the given target timbre feature t n and the semantic feature S a extracted from the audio time domain data X a_t1 Data X a_tn .
  • X a_t1 and X a_tn are audio time series data with a length of 96000
  • the semantic feature S a and the timbre feature t n are both feature vectors with a length of 128.
  • X a_tn G(X a_t1
  • t 1 and t n respectively represent the timbre characteristics of the input audio and the timbre characteristics of the expected generated audio.
  • original audio features of original audio time domain data are obtained, and target timbre features are determined.
  • target timbre audio time domain data is generated.
  • accurate target timbre audio time domain data can be obtained according to the target timbre characteristics preset by the user, and audio conversion of multiple timbres can be obtained, without the need to re-convert multiple timbres for different timbres.
  • Train the network model For example, in specific application scenarios, according to the lolita tone selected by the user, the normal audio of the speaker is converted into lolita audio for playback, which increases the fun.
  • Figure 3 is a flow chart for generating target timbre audio time domain data according to an exemplary embodiment. As shown in Figure 3, based on original audio time domain data, original timbre features, target timbre features, and pre-trained audio Generate a network model to generate target timbre audio time domain data, including the following steps.
  • step S31 the semantic features of the original audio time domain data are obtained based on the original audio time domain data and the semantic encoder included in the audio generation network model.
  • the semantic encoder adopts the WaveNet network model.
  • the Wavenet network model is a sequence generation model that can be used for speech generation modeling.
  • Wavenet can directly learn the mapping of sample value sequences, so it has good synthesis effects.
  • the processor can obtain the original timbre features that do not contain the original audio time domain data, but only contain the semantic features of the original audio time domain data.
  • step S32 target timbre audio time domain data is generated based on semantic features, original timbre features, target timbre features, and the generator included in the audio generation network model.
  • the generator is used to generate audio time domain data based on semantic features and timbre features.
  • the generator also uses the WaveNet network model.
  • the generator implements mapping, and then implement the mapping sum of X a_t1 ⁇ X a_tn of mapping.
  • the semantic features of the original audio time domain data are obtained based on the original audio time domain data and the semantic encoder included in the audio generation network model. Based on the semantic features, the original timbre features, the target timbre features, and the generator included in the audio generation model, the target timbre audio time domain data is generated.
  • the semantic features of clean original audio time domain data are obtained, and the mapping of original audio time domain data to predicted original audio time domain data is achieved. Mapping of original audio time domain data to target timbre audio time domain data X a_t1 ⁇ X a_tn , and mapping of target timbre audio time domain data to predicted original audio time domain data
  • Figure 4 is a flow chart illustrating a method of obtaining semantic features of original audio time domain data according to an exemplary embodiment. As shown in Figure 4, based on the original audio time domain data and the semantic encoder included in the audio generation network model , obtaining the semantic features of the original audio time domain data, including the following steps.
  • step S41 the original audio time domain data is input to the semantic encoder included in the audio generation network model, and the semantic features output by the semantic encoder are input to the timbre category classifier included in the audio generation network model.
  • the timbre category classifier is used to identify the categories of timbre features. For example, guitar sound, piano sound, violin sound. Among them, the timbre category classifier determines the character identity information corresponding to the audio file based on the timbre feature t 1 contained in the semantic feature S a , which is used to constrain The semantic feature S a extracted by the semantic encoder should try not to contain the timbre feature t 1 .
  • step S42 the semantic encoder is constrained based on the output of the timbre category classifier, so that the corresponding categories of timbre features included in the semantic features output by the semantic encoder are different from the corresponding categories of the original timbre features, and the original audio time domain data is obtained semantic features.
  • adversarial training is performed between the semantic encoder and the timbre category classifier.
  • the timbre category classifier is trained, and the original audio time domain data X a_t1 and the category I c of the audio file are input into the untrained timbre category classifier.
  • the loss value calculated based on the domain adversarial loss function L cls .
  • the timbre category classifier is optimized by minimizing the loss value until the timbre category classifier converges, and the trained timbre category classifier is obtained.
  • the semantic encoder train the semantic encoder, fix the network weight parameters of the timbre category classifier, input the original audio time domain data X a_t1 and the category I c of the audio file to the untrained semantic encoder, and obtain the semantic features including timbre features. . Input the semantic features including timbre features into the trained timbre category classifier. According to the loss value calculated by the domain adversarial loss function L cls , the semantic encoder is optimized to maximize the loss value until the semantic encoder converges. , get the trained semantic encoder.
  • the domain adversarial loss function L cls is used to calculate the loss value. This loss is to ensure that the semantic encoder extracts the timbre from the audio temporal data. Semantic information features independent of information features. Among them, the domain adversarial loss function L cls is expressed as
  • the timbre category classifier the purpose is to accurately determine the identity information of the person to whom the audio belongs based on the semantic feature S a , which will force the semantic encoder to extract timbre information from the audio time domain data, so training the timbre category classifier The goal is to minimize the loss function L cls .
  • the purpose of the semantic encoder is to expect that the extracted semantic feature S a does not contain timbre information features, so the purpose of training the semantic encoder is to maximize the loss function L cls .
  • original audio time domain data is input to a semantic encoder included in the audio generation network model, and semantic features output by the semantic encoder are input to a timbre category classifier included in the audio generation network model.
  • the semantic encoder is constrained based on the output of the timbre category classifier, so that the corresponding categories of timbre features included in the semantic features output by the semantic encoder are different from the corresponding categories of the original timbre features, and the semantic features of the original audio time domain data are obtained.
  • the semantic encoder is finally able to extract corresponding semantic features from the input audio.
  • FIG. 5 is a flow chart of a pre-trained generator according to an exemplary embodiment. As shown in Figure 5, the generator is pre-trained in the following manner, including the following steps.
  • step S51 the first audio time domain data is input to the semantic encoder to obtain the first audio semantic feature, and the first audio semantic feature and the target timbre feature are input to the prediction generator to obtain the target timbre audio time domain prediction. data.
  • step S52 the target timbre audio time domain prediction data is input to the semantic encoder, the semantic features of the target timbre audio time domain prediction data are obtained, and the semantic features and the timbre features of the first audio are input to the prediction generator to obtain the third 2. Audio time domain prediction data.
  • the true/false adversarial loss and the timbre feature regression loss are determined through a preset discriminator. Based on the target timbre audio time domain prediction data and target timbre features, the discriminator determines the true/false adversarial loss and timbre feature regression loss. Based on the first audio time domain data and the second audio time domain prediction data, the reconstruction loss is determined. Based on the true/false adversarial loss, timbre feature regression loss and reconstruction loss, the training of the prediction generator is constrained to obtain a generator that satisfies the constraint conditions.
  • adversarial training is performed between the generator and the judge.
  • the judger is first trained.
  • Each training session contains two sets of inputs: the first set is the original audio time domain data X a_t1 and the original timbre feature t 1 , and the second set of inputs is the original audio time domain data X a_t1 input by the generator.
  • the target timbre audio time domain data X a_tn and the target timbre feature t n generated after the timbre transformation are generated with the target timbre feature t n .
  • the network parameters of the discriminator are optimized based on the true/false adversarial loss and timbre feature regression loss until the discriminator converges, and the trained discriminator is obtained.
  • the output of the discriminator is the predicted vector and true/false probability value for each group of inputs. Then train the generator. Each training contains three sets of inputs: the first set of inputs is the original audio time domain data X a_t1 and the target timbre feature t n , the target timbre audio time domain data X a_tn is output, and X a_tn is input to the discriminator In the processor, true/false probability values and timbre features are predicted; the second set of inputs is the original audio time domain data X a_t1 and the original timbre feature t 1 , and the reconstructed original audio time domain data is output Used to calculate the second item in the reconstruction loss L rec ; the third set of inputs is the target timbre audio time domain data X a_tn and the original timbre feature t 1 , and the reconstructed original audio time domain data is output Used to calculate the first term in the reconstruction loss L rec . Then the loss value is calculated based on
  • the timbre feature regression loss function In the embodiment of the present disclosure, in the process of training the judge and the generator, the timbre feature regression loss function, the true/false adversarial loss function, and the reconstruction loss function are used as the loss function.
  • Timbre feature regression loss This loss is to make the timbre attributes of the audio generated by the generator consistent with the given timbre features.
  • the loss function is expressed as The first item in L t is the discriminator to predict the timbre characteristics of the input original audio time domain data Timbre characteristic prediction ability.
  • the second item is to use a discriminator with certain timbre prediction capabilities to predict the corresponding timbre features of the audio time domain data G(X a_t1
  • the L2 regression loss forces the timbre features of the audio time domain data generated by the generator to be as consistent as possible with the given target timbre features.
  • the true/false adversarial loss function is used to adjust the audio time domain data generated by the generator to be as consistent as possible with the real audio time domain data distribution.
  • the loss function is expressed as L I (G,D , _ _ _ _ _ _ _ _ _ _ _ _ _ The discriminator expects to output a true and false predicted value close to 1.
  • the discriminator For the audio time domain data generated by the generator, the discriminator expects to output a true and false predicted value close to 0.
  • the discriminator The generator will output a probability prediction value close to 0 based on the audio time domain data generated by the generator, and for the audio time domain data generated by the trained generator, the discriminator expects to output a true or false probability value close to 1.
  • the generator hopes that the generated audio time domain data is as similar as possible to the real audio time domain data, that is, the discriminator will output a probability prediction value close to 1 based on the audio file generated by the generator.
  • the final goal of the generator is to minimize the true/false adversarial loss function, and the goal of the discriminator is to maximize the true/false adversarial loss function.
  • the loss function is reconstructed.
  • the loss is to make the audio time domain data generated by the generator consistent with the input audio time domain data in terms of semantic information, and only changes the timbre characteristics.
  • the loss function is expressed as Where G(G(X a_t1
  • the time domain data X a_tn is then extracted through the encoder to extract the semantic features of
  • the generator also reconstructs the input audio time domain data based on the semantic feature S a and the corresponding timbre feature t 1 of the input original audio time domain data X a_t1 .
  • the generator can keep the semantic information unchanged while changing the timbre attributes, the reconstructed audio temporal data should be very similar to the input audio temporal data. Therefore, the difference value between the reconstructed audio time-domain data and the input audio time-domain data is calculated through L2 regression loss, thereby forcing the generator to keep the semantic features unchanged when generating audio time-domain data.
  • ⁇ 1 is a hyperparameter used to measure the importance of reconstruction loss at different stages.
  • the total loss function of the audio generation method of timbre transformation includes domain adversarial loss, timbre feature regression loss, reconstruction loss and true/false adversarial loss. There are differences in the weights of different loss functions.
  • the final loss function L represents for Among them, ⁇ cls , ⁇ t , ⁇ rec and ⁇ I are hyperparameters that control the relative importance of each loss.
  • G * represents the audio generation network for timbre transformation, Indicates that the network weight parameters of the generator are optimized with the goal of minimizing the loss value of the loss function L; Indicates that the network weight parameters of the discriminator and timbre category classifier are optimized with the goal of maximizing the loss value of the loss function L.
  • the original semantic training features and the original audio training features of the audio time domain training data are input to the prediction generator to obtain the first audio time domain prediction data.
  • the original semantic training features and the target audio training features of the audio time domain training data are input to the prediction generator to obtain the second audio time domain prediction data.
  • the predicted semantic training features of the second audio time domain prediction data and the original audio training features are input to the prediction generator to obtain the third audio time domain prediction data.
  • the generator can effectively combine semantic features with timbre features, and generate audio time domain data after timbre transformation.
  • the timbre features of the data are consistent with the given target timbre features, and the semantic features are consistent with those input by the semantic encoder.
  • the audio time domain data is consistent.
  • Figure 6 shows a schematic diagram of timbre conversion audio generation.
  • the timbre features of the target speaker are obtained through the trained WaveNet classification model, and then the features and the original audio time domain data are input into the timbre conversion audio generation network to generate timbre transformed audio.
  • Time domain data can be used to convert a single audio file into multiple timbre audio files based on the trained model. There is no need to retrain the network model for different conversion scenarios, and the model has strong generalization performance.
  • embodiments of the present disclosure also provide a device for generating audio.
  • the device for generating audio provided by the embodiments of the present disclosure includes hardware structures and/or software modules corresponding to each function.
  • the embodiments of the present disclosure can be implemented in the form of hardware or a combination of hardware and computer software. Whether a function is performed by hardware or computer software driving the hardware depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered to go beyond the scope of the technical solutions of the embodiments of the present disclosure.
  • Figure 7 is a block diagram of a device for generating audio according to an exemplary embodiment.
  • the device 100 includes an acquisition unit 101 , an extraction unit 102 and a generation unit 103 .
  • the acquisition unit 101 is used to obtain the original audio time domain data; the extraction unit 102 is used to extract the timbre features of the original audio time domain data to obtain the original timbre features; the generation unit 103 is used to obtain the original timbre features based on the original audio time domain data and the original timbre features. and target timbre features to generate target timbre audio time domain data.
  • the semantic features in the target timbre audio time domain data match the semantic features of the original audio time domain data.
  • the timbre features in the target timbre audio time domain data match the target timbre features.
  • the generation unit 103 generates target timbre audio time domain data based on the original audio time domain data, original timbre features and target timbre features in the following manner: based on the original audio time domain data, original timbre features and target timbre features, And a pre-trained audio generation network model to generate target timbre audio time domain data; the audio generation network model is used to perform timbre conversion on audio time domain data to generate audio time domain data after timbre conversion.
  • the generation unit 103 generates target timbre audio time domain data based on original audio time domain data, original timbre features and target timbre features, and a pre-trained audio generation network model in the following manner: based on the original audio time domain data , and the semantic encoder included in the audio generation network model to obtain the semantic features of the original audio time domain data; based on the semantic features, original timbre features, target timbre features, and the generator included in the audio generation model, when generating the target timbre audio Domain data; the generator is used to generate audio time domain data based on semantic features and timbre features.
  • the generation unit 103 obtains the semantic features of the original audio time domain data based on the original audio time domain data and the semantic encoder included in the audio generation network model in the following manner: input the original audio time domain data to the audio Generate a semantic encoder included in the network model, and input the semantic features output by the semantic encoder to a timbre category classifier included in the audio generation network model; the timbre category classifier is used to identify the timbre category of the input semantic feature; classify based on the timbre category
  • the output of the encoder constrains the semantic encoder so that the semantic features output by the semantic encoder do not contain any timbre features, and the semantic features of the original audio time domain data are obtained.
  • the generator is pre-trained in the following manner: input the first audio time domain data to the semantic encoder to obtain the first audio semantic features, and input the first audio semantic features and target timbre features into the prediction generation
  • the device obtains the target timbre audio time domain prediction data; inputs the target timbre audio time domain prediction data to the semantic encoder, obtains the semantic features of the target timbre audio time domain prediction data, and inputs the semantic features and the timbre features of the first audio into
  • the prediction generator obtains the second audio time domain prediction data; based on the first audio time domain data and corresponding timbre features, through the preset discriminator, determines the true/false adversarial loss and timbre feature regression loss; based on the target timbre audio time Domain prediction data and target timbre features are passed through the discriminator to determine the true/false adversarial loss and timbre feature regression loss; based on the first audio time domain data and the second audio time domain prediction data, the reconstruction loss is determined; based on the true/fals
  • FIG. 8 is a block diagram of a device for generating audio according to an exemplary embodiment.
  • the device 200 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, or the like.
  • device 200 may include one or more of the following components: processing component 202, memory 204, power component 206, multimedia component 208, audio component 210, input/output (I/O) interface 212, sensor component 214, and Communication component 216.
  • Processing component 202 generally controls the overall operations of device 200, such as operations associated with display, phone calls, data communications, camera operations, and recording operations.
  • the processing component 202 may include one or more processors 220 to execute instructions to complete all or part of the steps of the above method.
  • processing component 202 may include one or more modules that facilitate interaction between processing component 202 and other components.
  • processing component 202 may include a multimedia module to facilitate interaction between multimedia component 208 and processing component 202.
  • Memory 204 is configured to store various types of data to support operations at device 200 . Examples of such data include instructions for any application or method operating on device 200, contact data, phonebook data, messages, pictures, videos, etc.
  • Memory 204 may be implemented by any type of volatile or non-volatile storage device, or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EEPROM), Programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read-only memory
  • EEPROM erasable programmable read-only memory
  • EPROM Programmable read-only memory
  • PROM programmable read-only memory
  • ROM read-only memory
  • magnetic memory flash memory, magnetic or optical disk.
  • Power component 206 provides power to various components of device 200 .
  • Power components 206 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to device 200 .
  • Multimedia component 208 includes a screen that provides an output interface between the device 200 and the user.
  • the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user.
  • the touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide action.
  • multimedia component 208 includes a front-facing camera and/or a rear-facing camera.
  • the front camera and/or the rear camera may receive external multimedia data.
  • Each front-facing camera and rear-facing camera can be a fixed optical lens system or have a focal length and optical zoom capabilities.
  • Audio component 210 is configured to output and/or input audio signals.
  • audio component 210 includes a microphone (MIC) configured to receive external audio signals when device 200 is in operating modes, such as call mode, recording mode, and voice recognition mode. The received audio signals may be further stored in memory 204 or sent via communications component 216 .
  • audio component 210 also includes a speaker for outputting audio signals.
  • the I/O interface 212 provides an interface between the processing component 202 and a peripheral interface module, which may be a keyboard, a click wheel, a button, etc. These buttons may include, but are not limited to: Home button, Volume buttons, Start button, and Lock button.
  • Sensor component 214 includes one or more sensors for providing various aspects of status assessment for device 200 .
  • the sensor component 214 can detect the open/closed state of the device 200, the relative positioning of components, such as the display and keypad of the device 200, and the sensor component 214 can also detect a change in position of the device 200 or a component of the device 200. , the presence or absence of user contact with the device 200 , device 200 orientation or acceleration/deceleration and temperature changes of the device 200 .
  • Sensor assembly 214 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact.
  • Sensor assembly 214 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications.
  • the sensor component 214 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
  • Communication component 216 is configured to facilitate wired or wireless communication between apparatus 200 and other devices.
  • Device 200 may access a wireless network based on a communication standard, such as WiFi, 4G or 5G, or a combination thereof.
  • the communication component 216 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel.
  • the communications component 216 also includes a near field communications (NFC) module to facilitate short-range communications.
  • NFC near field communications
  • the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.
  • RFID radio frequency identification
  • IrDA infrared data association
  • UWB ultra-wideband
  • Bluetooth Bluetooth
  • apparatus 200 may be configured by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable Gate array (FPGA), controller, microcontroller, microprocessor or other electronic components are implemented for executing the above method.
  • ASICs application specific integrated circuits
  • DSPs digital signal processors
  • DSPDs digital signal processing devices
  • PLDs programmable logic devices
  • FPGA field programmable Gate array
  • controller microcontroller, microprocessor or other electronic components are implemented for executing the above method.
  • a non-transitory computer-readable storage medium including instructions such as a memory 204 including instructions, which can be executed by the processor 220 of the device 200 to complete the above method is also provided.
  • the non-transitory computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
  • first, second, etc. are used to describe various information, but such information should not be limited to these terms. These terms are only used to distinguish information of the same type from each other and do not imply a specific order or importance. In fact, expressions such as “first” and “second” can be used interchangeably.
  • first information may also be called second information, and similarly, the second information may also be called first information.
  • connection includes a direct connection without other components between the two, and also includes an indirect connection with other elements between the two.

Landscapes

  • Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Electrophonic Musical Instruments (AREA)

Abstract

本公开是关于一种生成音频的方法、装置及存储介质。生成音频的方法包括:获取原始音频时域数据;提取所述原始音频时域数据的音色特征,得到原始音色特征;基于所述原始音频时域数据、所述原始音色特征以及目标音色特征,生成目标音色音频时域数据,所述目标音色音频时域数据中的语义特征匹配所述原始音频时域数据的语义特征,所述目标音色音频时域数据中的音色特征匹配所述目标音色特征。通过本公开,在音频转换的过程中,避免了音频从时域转换成频域的信息丢失,使音色变换后的音频时域数据更加的逼真。

Description

一种生成音频的方法、装置及存储介质 技术领域
本公开涉及音频技术领域,尤其涉及一种生成音频的方法、装置及存储介质。
背景技术
声音转换技术具有广泛的应用场景,其中,音频音色转换技术是声音转换的一种。
其中,音频音色转换是通过计算机从音频形式(时序、频谱等)中提取到与音色无关的语义信息和特定的音色特征,然后将语义信息与不同音色特征组合来实现音频音色转换。
相关技术中,通过将音频时序数据转换为频谱数据,通过改变频谱图像的风格来改变音频音色,最后再音色变换后的频谱数据转换为时序数据。然而,此种方式会存在音色变换后的音频时域数据不够逼真的现象。
发明内容
为克服相关技术中存在的问题,本公开提供一种生成音频的方法、装置及存储介质。
根据本公开实施例的第一方面,提供一种生成音频的方法,包括:
获取原始音频时域数据;提取所述原始音频时域数据的音色特征,得到原始音色特征;基于所述原始音频时域数据、所述原始音色特征以及目标音色特征,生成目标音色音频时域数据,所述目标音色音频时域数据中的语义特征匹配所述原始音频时域数据的语义特征,所述目标音色音频时域数据中的音色特征匹配所述目标音色特征。
一种实施方式中,所述基于所述原始音频时域数据、所述原始音色特征以及目标音色特征,生成目标音色音频时域数据,包括:基于所述原始音频时域数据、所述原始音色特征以及目标音色特征,以及预先训练的音频生成网络模型,生成目标音色音频时域数据;所述音频生成网络模型用于对音频时域数据进行音色转换生成音色转换后的音频时域数据。
一种实施方式中,所述基于所述原始音频时域数据、所述原始音色特征以及目标音色特征,以及预先训练的音频生成网络模型,生成目标音色音频时域数据,包括:基于所述原始音频时域数据,以及音频生成网络模型中包括的语义编码器,得到所述原始音频时域数据的语义特征;基于所述语义特征、所述原始音色特征、所述目标音色特征、以及所述音频生成网络模型中包括的生成器,生成目标音色音频时域数据;所述生成器用于基于语义特征和音色特征生成音频时域数据。
一种实施方式中,所述基于所述原始音频时域数据,以及音频生成网络模型中包括的语义编码器,得到所述原始音频时域数据的语义特征,包括:将所述原始音频时域数据输入至音频生成网络模型中包括的语义编码器,并将所述语义编码器输出的语义特征输入至所述音频生成网络模型中包括的音色类别分类器;所述音色类别分类器用于识别输入语义特征的音色类别;基于所述音色类别分类器的输出对所述语义编码器进行约束,以使所述语义编码器输出的语义特征中不包含任何音色特征,得到所述原始音频时域数据的语义特征。
一种实施方式中,所述生成器采用如下方式训练:将第一音频时域数据输入至所述语义编码器,获得第一音频语义特征,并将所述第一音频语义特征和目标音色特征,输入至预测生成器,得到目标音色音频时域预测数据;将目标音色音频时域预测数据输入至所述语义编码器,获得目标音色音频时域预测数据的语义特征,并将所述语义特征与第一音频的音色特征输入至预测生成器,得到第二音频时域预测数据;基于所述第一音频时域数据和对应的音色特征,通过预设判别器,确定真/伪对抗损失、以及音色特征回归损失;基于目标音色音频时域预测数据和目标音色特征,通过判别器,确定真/伪对抗损失、以及音色特征回归损失;基于所述第一音频时域数据、所述第二音频时域预测数据,确定重建损失;基于所述真/伪对抗损失、所述音色特征回归损失以及所述重建损失,对所述预测生成器的训练进行约束,得到满足约束条件的生成器。
根据本公开实施例的第二方面,提供一种生成音频的装置,包括:
获取单元,用于获取原始音频时域数据;提取单元,用于提取所述原始音频时域数据的音色特征,得到原始音色特征;生成单元,用于基于所述原始音频时域数据、所述原始音色特征以及目标音色特征,生成目标音色音频时域数据,所述目标音色音频时域数据中的语义特征匹配所述原始音频时域数据的语义特征,所述目标音色音频时域数据中的音色特征匹配所述目标音色特征。
一种实施方式中,所述生成单元采用如下方式基于所述原始音频时域数据、所述原始音色特征以及目标音色特征,生成目标音色音频时域数据:基于所述原始音频时域数据、所述原始音色特征以及目标音色特征,以及预先训练的音频生成网络模型,生成目标音色音频时域数据;所述音频生成网络模型用于对音频时域数据进行音色转换生成音色转换后的音频时域数据。
一种实施方式中,所述生成单元采用如下方式基于所述原始音频时域数据、所述原始音色特征以及目标音色特征,以及预先训练的音频生成网络模型,生成目标音色音频时域数据:基于所述原始音频时域数据,以及音频生成网络模型中包括的语义编码器,得到所 述原始音频时域数据的语义特征;基于所述语义特征、所述原始音色特征、所述目标音色特征、以及所述音频生成网络模型中包括的生成器,生成目标音色音频时域数据;所述生成器用于基于语义特征和音色特征生成对应音色的音频时域数据。
一种实施方式中,所述生成单元采用如下方式基于所述原始音频时域数据,以及音频生成网络模型中包括的语义编码器,得到所述原始音频时域数据的语义特征:将所述原始音频时域数据输入至音频生成网络模型中包括的语义编码器,并将所述语义编码器输出的语义特征输入至所述音频生成网络模型中包括的音色类别分类器;所述音色类别分类器用于识别音色特征的类别;基于所述音色类别分类器的输出对所述语义编码器进行约束,以使所述语义编码器输出的语义特征中不包括音色特征,得到所述原始音频时域数据的语义特征。
一种实施方式中,所述生成器采用如下方式预先训练:将第一音频时域数据输入至所述语义编码器,获得第一音频语义特征,并将所述第一音频语义特征和目标音色特征,输入至预测生成器,得到目标音色音频时域预测数据;将目标音色音频时域预测数据输入至所述语义编码器,获得目标音色音频时域预测数据的语义特征,并将所述语义特征与第一音频的音色特征输入至预测生成器,得到第二音频时域预测数据;基于所述第一音频时域数据和对应的音色特征,通过预设判别器,确定真/伪对抗损失、以及音色特征回归损失;基于目标音色音频时域预测数据和目标音色特征,通过判别器,确定真/伪对抗损失、以及音色特征回归损失;基于所述第一音频时域数据、所述第二音频时域预测数据,确定重建损失;基于所述真/伪对抗损失、所述音色特征回归损失以及所述重建损失,对所述预测生成器的训练进行约束,得到满足约束条件的生成器。
根据本公开实施例第三方面,提供一种生成音频的装置,包括:
处理器;
用于存储处理器可执行指令的存储器;
其中,所述处理器被配置为:执行第一方面或者第一方面任意一种实施方式中所述的方法。
根据本公开实施例第四方面,提供一种计算机可读存储介质,所述存储介质中存储有指令,当所述存储介质中的指令由终端的处理器执行时,使得终端能够执行第一方面或者第一方面任意一种实施方式中所述的方法。
本公开的实施例提供的技术方案可以包括以下有益效果:在获取得到的原始音频时域数据的基础上,提取该音频时域数据的音色特征,得到原始音色特征。进一步的,基于原始音频时域数据、原始音色特征以及目标音色特征,生成目标音色音频时域数据。基于此, 在音频转换的过程中,避免了音频从时域转换成频域的信息丢失,使音色变换后的音频时域数据更加的逼真。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理。
图1是根据一示例性实施例示出的一种生成音频的方法的流程图。
图2是根据一示例性实施例示出的一种生成目标音色音频时域数据的流程图。
图3是根据一示例性实施例示出的一种生成目标音色音频时域数据的流程图。
图4是根据一示例性实施例示出的一种得到原始音频时域数据的语义特征的流程图。
图5是根据一示例性实施例示出的一种预先训练生成器的流程图。
图6示出了音色转换音频生成的示意图。
图7是根据一示例性实施例示出的一种生成音频的装置框图。
图8是根据一示例性实施例示出的一种生成音频的装置的框图。
具体实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。
在附图中,自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。所描述的实施例是本公开一部分实施例,而不是全部的实施例。下面通过参考附图描述的实施例是示例性的,旨在用于解释本公开,而不能理解为对本公开的限制。基于本公开中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本公开保护的范围。下面结合附图对本公开的实施例进行详细说明。
本公开实施例提供的生成音频的方法,可以应用于音频合成、人机交互、虚拟现实等应用领域,尤其可以涉及音乐创作、游戏变声、有声阅读、网络直播等等场景。当人们运用更实用的方式创造内容或音乐时,产品最终的价值会远远超过艺术创造过程。比如,音乐家在创作时会冒出一些灵感,但之后可能会忘记,因此可通过本发明技术将音乐家哼出的旋律使用不同的乐器演奏出来,确定出音乐的最佳演奏方式和调整乐谱旋律,从而创作 出富有表现力的音乐作品。其次,游戏社交化是近年来游戏行业发展的重要趋势,为游戏中的玩家添加变声的选项,可以让游戏里的语音互动变得更具娱乐性,通过提高游戏的社交属性来提高用户黏着度。再有,通过将结合声音转换技术,用户可以选择以自己亲人的口吻讲述书籍中的故事,小朋友也可以选择用自己喜爱的动漫人物的口吻来讲述自己小朋友喜欢的故事。还有,网络主播可以通过音色转换,在保证语言风格特点同时,根据不同的业务场景选择不同音色的声音,不仅能变成娱乐搞笑声音,还能变成目标音色的声音等,能够增加网络直播的趣味性。
声音转换技术具有广泛的应用场景,其中,音频音色转换技术是声音转换的一种。音频音色转换是通过计算机从音频形式(时序、频谱等)中提取到与音色无关的语义信息和特定的音色特征,然后将语义信息与不同音色特征组合来实现音频音色转换。
相关技术中,通过CQT转换计算得到输入音频的CQT频谱,然后通过Cycle-GAN网络将该CQT频谱转换为目标领域音色的音频的CQT频谱,从而实现音频CQT频谱的转换。该技术通过预训练的WaveNet网络模型将音色变换后的CQT频谱转换为时域音频,从而生成音色变换后的目标风格音频。该情况下,这种音频音色转换方法存在以下两方面的问题:第一方面,基于音色变换后的频谱生成的时域数据不够逼真。这是由于音频时域数据时间跨度大,时长为1秒的音频采样点数量能够达到11052个,直接将音频数据转换为频谱来实现音色转换容易丢失音频的部分信息,使得音色变换后的音频与输入音频存在较大的语义差异,甚至会带有大量噪声。音频从时序数据转换为频谱形式,不同音色的频谱包络在不同的音高下并不服从同峰值模式,还有不同的泛音和谐频需要处理,因此从频谱图像中提取的音色特征和语义特征存在较大的难度。第二方面,一个训练好的模型无法实现多种风格音色的音频文件转换。由于使用的是Cycle-GAN来实现CQT频谱音色的变换,训练好的模型只能实现一种音色到另一种音色CQT频谱的转换,若想将输入频谱音色转换为N种音色,需要训练N个不同的Cycle-GAN模型,导致工作量变大。
有鉴于此,本公开提供了一种生成音频的方法,在获取得到的原始音频时域数据的基础上,提取该音频时域数据的音色特征,得到原始音色特征。进一步的,基于原始音频时域数据、原始音色特征以及目标音色特征,生成目标音色音频时域数据。基于此,在音频转换的过程中,避免了音频从时域转换成频域的信息丢失,使音色变换后的音频时域数据更加的逼真。因此,相较于相关技术中对音频音色转换的方法,本公开提供的生成音频的方法体现的更加灵活及真实。
图1是根据一示例性实施例示出的一种生成音频的方法的流程图,如图1所示,生成音频的方法用于终端中,包括以下步骤。
在步骤S11中,获取原始音频时域数据。
在步骤S12中,提取原始音频时域数据的音色特征,得到原始音色特征。
其中,音色是指不同声音表现在波形方面总是有与众不同的特性,不同的物体振动都有不同的特点。
在本公开实施例中,可以采用梅尔频率倒谱系数(Mel-Frequency Cepstral Coefficients,MFCC)对原始音频时域数据的音色特征进行提取,得到原始音色特征t 1。可以理解的是,本公开实施例中不对如何提取音频中的音色特征进行具体限定。
在步骤S13中,基于原始音频时域数据、原始音色特征以及目标音色特征,生成目标音色音频时域数据。
其中,目标音色音频时域数据中的语义特征匹配原始音频时域数据的语义特征,目标音色音频时域数据中的音色特征匹配目标音色特征t n
在本公开中,获取原始音频时域数据。提取原始音频时域数据的音色特征,得到原始音色特征。基于原始音频时域数据、原始音色特征以及目标音色特征,生成目标音色音频时域数据。通过本公开,保证了音色转换后的音频,既能保留住原始音频时域数据中用户说话的内容,也能将原始音色特征准确地转换成设定的目标音色特征,解决了音色转换过程中信息的丢失。
在下述公开实施例中,将对生成目标音色音频时域数据的过程进行详细描述。
图2是根据一示例性实施例示出的一种生成目标音色音频时域数据的流程图,如图2所示,基于原始音频时域数据、原始音色特征以及目标音色特征,生成目标音色音频时域数据,包括以下步骤。
在步骤S21中,获取原始音频时域数据的原始音频特征,并确定目标音色特征。
在本公开实施例中,获取原始音频时域数据的原始音频特征t 1,并确定目标音色特征t n。其中,确定目标音色特征的过程实质上就是搭建目标音色特征数据集的过程。在该过程中,从包含不同说话人的数据集中选取部分说话人的音频时域数据,然后根据不同说话人的身份信息进行音色分类,得到具有不同音色的音频时域数据集。相关技术中,从音频时域数据中准确获得音色特征和语义特征,然后通过生成器生成音色变换后的音频时域数据。然而音色特征的表示比较困难,虽然有学者采取对频谱图进行说话人类别分类,然后选用分类前的特征作为音色特征,但是这种方式过于粗糙,频谱与时域数据的信息量存在差距,提取到的音色特征不够全面和准确。本公开实施例中,通过训练WaveNet网络模型对音频时域数据集的说话人音频文件分类,当WaveNet网络模型能够准确预测输入音频对应的说话人身份时,模型从音频文件中提取到的特征包含了该说话人特有的信息,而不同 说话人之间的差异主要是音色的差异,也就是说此时获得的特征是与音色强相关的特征,即音色特征。
在步骤S22中,基于原始音频时域数据、原始音色特征以及目标音色特征,以及预先训练的音频生成网络模型,生成目标音色音频时域数据。
其中,音频生成网络模型用于对音频时域数据进行音色转换生成音色转换后的音频时域数据。
在本公开实施例中,音频生成网络模型包括语音编码器、音色类别分类器、生成器和判别器。该网络可以根据给定的目标音色特征t n和从音频时域数据X a_t1中提取到的语义特征S a,生成与目标音色特征t n对应音频音色一致且内容与X a_t1一致的音频时域数据X a_tn。其中,X a_t1,X a_tn为长度为96000的音频时序数据,语义特征S a和音色特征为t n均为长度为128的特征向量。其中X a_tn=G(X a_t1|t n)。t 1和t n分别表示输入音频的音色特征和期望生成音频的音色特征。
在本公开中,获取原始音频时域数据的原始音频特征,并确定目标音色特征。基于原始音频时域数据、原始音色特征以及目标音色特征,以及预先训练的音频生成网络模型,生成目标音色音频时域数据。通过本公开,基于预先训练的音频生成网络模型,根据用户预设的目标音色特征,得到准确的目标音色音频时域数据,以及可以得到多种音色的音频转换,无需针对不同音色转换多次重新训练网络模型。例如,在具体的应用场景下,根据用户选定的萝莉音,将说话人正常的音频,转换成萝莉音音频进行播放,增加了趣味性。
图3是根据一示例性实施例示出的一种生成目标音色音频时域数据的流程图,如图3所示,基于原始音频时域数据、原始音色特征以及目标音色特征,以及预先训练的音频生成网络模型,生成目标音色音频时域数据,包括以下步骤。
在步骤S31中,基于原始音频时域数据,以及音频生成网络模型中包括的语义编码器,得到原始音频时域数据的语义特征。
在本公开实施例中,语义编码器采用WaveNet网络模型。其中,Wavenet网络模型是一种序列生成模型,可以用于语音生成建模。在语音合成的声学模型建模中,Wavenet可以直接学习到采样值序列的映射,因此具有很好的合成效果。语义编码器被训练来从音色特征为t 1,语义特征为S a的音频时域数据X a_t1中提取语义特征S a,即S a=E(X a_t1),即,经过预训练的语义编码器,可以得到不包含原始音频时域数据的原始音色特征,只包含原始音频时域数据的语义特征。
在步骤S32中,基于语义特征、原始音色特征、目标音色特征、以及音频生成网络模型中包括的生成器,生成目标音色音频时域数据。
其中,生成器用于基于语义特征和音色特征生成音频时域数据。生成器也采用WaveNet网络模型。
在本公开实施例中,生成器实现
Figure PCTCN2022097437-appb-000001
的映射,然后实现X a_t1→X a_tn的映射和
Figure PCTCN2022097437-appb-000002
的映射。
在本公开中,基于原始音频时域数据,以及音频生成网络模型中包括的语义编码器,得到原始音频时域数据的语义特征。基于语义特征、原始音色特征、目标音色特征、以及音频生成模型中包括的生成器,生成目标音色音频时域数据。通过本公开,得到干净的原始音频时域数据的语义特征,以及实现原始音频时域数据到预测原始音频时域数据的映射
Figure PCTCN2022097437-appb-000003
原始音频时域数据到目标音色音频时域数据的映射X a_t1→X a_tn,和目标音色音频时域数据到预测原始音频时域数据的映射
Figure PCTCN2022097437-appb-000004
图4是根据一示例性实施例示出的一种得到原始音频时域数据的语义特征的流程图,如图4所示,基于原始音频时域数据,以及音频生成网络模型中包括的语义编码器,得到原始音频时域数据的语义特征,包括以下步骤。
在步骤S41中,将原始音频时域数据输入至音频生成网络模型中包括的语义编码器,并将语义编码器输出的语义特征输入至音频生成网络模型中包括的音色类别分类器。
其中,音色类别分类器用于识别音色特征的类别。例如,吉他音、钢琴音、小提琴音......其中,音色类别分类器则是根据语义特征S a中含有的音色特征t 1来判断该音频文件对应的人物身份信息,用于约束语义编码器提取的语义特征S a尽量不含音色特征t 1
在步骤S42中,基于音色类别分类器的输出对语义编码器进行约束,以使语义编码器输出的语义特征中包括的音色特征对应类别不同于原始音色特征对应类别征,得到原始音频时域数据的语义特征。
在本公开实施例中,语义编码器和音色类别分类器之间进行对抗训练。先对音色类别分类器进行训练,将原始音频时域数据X a_t1和音频文件的类别I c输入至未训练的音色类别分类器之中,根据域对抗损失函数L cls计算得出的损失值,按照最小化损失值的方式对音色类别分类器进行优化,直至音色类别分类器收敛,得到训练完成的音色类别分类器。再对语义编码器进行训练,固定住音色类别分类器的网络权重参数,将原始音频时域数据X a_t1和音频文件的类别I c输入至未训练的语义编码器,得到包含音色特征的语义特征。将包含音色特征的语义特征输入至训练完成的音色类别分类器,根据域对抗损失函数L cls计算得出的损失值,按照最大化损失值的方式对语义编码器进行优化,直至语义编码器收敛,得到训练完成的语义编码器。
在本公开实施例中,训练语义编码器和音色类别分类器的过程中,使用域对抗损失函 数L cls计算得出损失值,该损失是为了保证语义编码器从音频时序数据中提取到与音色信息特征无关的语义信息特征。其中,域对抗损失函数L cls表示为
Figure PCTCN2022097437-appb-000005
Figure PCTCN2022097437-appb-000006
对于音色类别分类器来说,其目的是能够根据语义特征S a准确判断出该音频所属人物的身份信息,将迫使语义编码器从音频时域数据中提取到音色信息,因此训练音色类别分类器的目标是最小化损失函数L cls。而语义编码器的目的是期望提取到的语义特征S a不含音色信息特征,因此训练语义编码器的目的是最大化损失函数L cls
在本公开中,将原始音频时域数据输入至音频生成网络模型中包括的语义编码器,并将语义编码器输出的语义特征输入至音频生成网络模型中包括的音色类别分类器。基于音色类别分类器的输出对语义编码器进行约束,以使语义编码器输出的语义特征中包括的音色特征对应类别不同于原始音色特征对应类别征,得到原始音频时域数据的语义特征。通过本公开,通过音色类别分类器和语义编码器的对抗训练,最终使语义编码器能够从输入音频中提取到相应的语义特征。
图5是根据一示例性实施例示出的一种预先训练生成器的流程图,如图5所示,生成器采用如下方式预先训练,包括以下步骤。
在步骤S51中,将第一音频时域数据输入至语义编码器,获得第一音频语义特征,并将第一音频语义特征和目标音色特征,输入至预测生成器,得到目标音色音频时域预测数据。
在步骤S52中,将目标音色音频时域预测数据输入至语义编码器,获得目标音色音频时域预测数据的语义特征,并将语义特征与第一音频的音色特征输入至预测生成器,得到第二音频时域预测数据。
在本公开中,基于第一音频时域数据和对应的音色特征,通过预设判别器,确定真/伪对抗损失、以及音色特征回归损失。基于目标音色音频时域预测数据和目标音色特征,通过判别器,确定真/伪对抗损失、以及音色特征回归损失。基于第一音频时域数据、第二音频时域预测数据,确定重建损失。基于真/伪对抗损失、音色特征回归损失以及重建损失,对预测生成器的训练进行约束,得到满足约束条件的生成器。
在本公开实施例中,生成器和判断器之间进行对抗训练。先对判断器进行训练,每次训练包含两组输入:第一组为原始音频时域数据X a_t1和原始音色特征t 1,第二组输入为生成器根据输入的原始音频时域数据X a_t1和目标音色特征t n生成的音色变换后的目标音色音频时域数据X a_tn和目标音色特征t n。根据真/伪对抗损失以及音色特征回归损失优化判别器的网络参数,直至判别器收敛,得到训练完成的判别器,该判别器的输出为每组输入预测 的向量和真/伪概率值。再对生成器进训练,每次训练包含三组输入:第一组输入为原始音频时域数据X a_t1和目标音色特征t n,输出目标音色音频时域数据X a_tn,将X a_tn输入到判别器中,预测真/伪概率值和音色特征;第二组输入为原始音频时域数据X a_t1和原始音色特征t 1,输出重建的原始音频时域数据
Figure PCTCN2022097437-appb-000007
用于计算重建损失L rec中的第二项;第三组输入为目标音色音频时域数据X a_tn和原始音色特征t 1,输出重建的原始音频时域数据
Figure PCTCN2022097437-appb-000008
用于计算重建损失L rec中的第一项。然后根据损失函数计算出损失值,然后通过反向传播的方式优化和更新生成器的网络参数。
在本公开实施例中,训练判断器和生成器的过程中,使用音色特征回归损失函数、真/伪对抗损失函数以及重建损失函数作为损失函数。音色特征回归损失,该损失是为了使生成器生成音频的音色属性与给定的音色特征一致,其损失函数表示为
Figure PCTCN2022097437-appb-000009
Figure PCTCN2022097437-appb-000010
其中L t中的第一项为判别器来预测输入的原始音频时域数据X a_t1的音色特征,并与该原始音频时序数据对应的原始音色特征t 1作L2回归损失,从而提高判别器的音色特征预测能力。第二项则是使用有一定音色预测能力的判别器来预测生成器生成的音频时域数据G(X a_t1|t n)的对应的音色特征,并与给出的期望目标音色特征t n作L2回归损失,迫使生成器生成的音频时域数据的音色特征与给定的目标音色特征尽量一致。
在本公开实施例中,真/伪对抗损失函数,该损失是为了调整生成器生成的音频时域数据与真实的音频时域数据分布尽量保持一致,其损失函数表示为L I(G,D,X a_t1,t n)=E[log(D(X a_t1))]+E[log(1-D(G(X a_t1|t n))],其中,对于输入的音频时域数据,判别器期望输出接近1的真伪预测值。对于生成器生成的音频时域数据,判别器期望输出接近于0的真伪预测值。在初始训练的时候,由于生成器的生成能力较弱,判别器将根据生成器生成的音频时域数据输出接近于0的概率预测值,而对已经训练好的生成器生成的音频时域数据,判别器期望输出接近于1的真伪概率值。此外,生成器希望生成的音频时域数据与真实的音频时域数据尽量相似,即判别器将根据生成器生成的音频文件输出接近于1的概率预测值,最终生成器的目标是最小化真/伪对抗损失函数,而判别器的目标是最大化真/伪对抗损失函数。
在本公开实施例中,重建损失函数,该损失是为了使生成器生成的音频时域数据与输入的音频时域数据在语义信息上保持一致,只改变音色特征,其损失函数表示为
Figure PCTCN2022097437-appb-000011
其中G(G(X a_t1|t n)|t 1)表示生成器根据输入原始音频时域数据X a_t1的语义特征S a和期望的目标音 色特征t n生成音色属性后变换后的目标音色音频时域数据X a_tn,然后再通过编码器提取X a_tn的语义特征,然后与原始音色特征t 1一并输入至生成器,从而输出音频时域数据
Figure PCTCN2022097437-appb-000012
此外,生成器还根据输入原始音频时域数据X a_t1的语义特征S a和对应的音色特征t 1来重建输入的音频时域数据。如果生成器在改变音色属性的同时能够保持语义信息不变,那么重建的音频时域数据与输入的音频时域数据应该非常相似。因此,通过L2回归损失来计算重建音频时域数据与输入音频时域数据的差异值,从而迫使生成器在生成音频时域数据时保持语义特征不变。此外λ 1为超参数,用于度量不同阶段重建损失的重要性。
在本公开实施例中,音色变换的音频生成方法的总损失函数包括域对抗损失、音色特征回归损失、重建损失和真/伪对抗损失,不同损失函数的权重存在差异,最终的损失函数L表示为
Figure PCTCN2022097437-appb-000013
Figure PCTCN2022097437-appb-000014
其中,λ cls、λ t、λ rec和λ I是控制每项损失相对重要程度的超参数。最后,整个网络的训练可定义为标准生成对抗网络的最小化最大化问题:
Figure PCTCN2022097437-appb-000015
Figure PCTCN2022097437-appb-000016
其中,G 表示音色变换的音频生成网络,
Figure PCTCN2022097437-appb-000017
表示以最小化损失函数L的损失值为目标,优化生成器的网络权重参数;
Figure PCTCN2022097437-appb-000018
表示以最大化损失函数L的损失值为目标,优化判别器和音色类别分类器的网络权重参数。
在本公开中,将音频时域训练数据的原始语义训练特征和原始音频训练特征,输入至预测生成器,得到第一音频时域预测数据。将音频时域训练数据的原始语义训练特征和目标音频训练特征,输入至预测生成器,得到第二音频时域预测数据。将第二音频时域预测数据的预测语义训练特征和原始音频训练特征,输入至预测生成器,得到第三音频时域预测数据。通过本公开,生成器能够有效的将语义特征与音色特征结合,并且生成音色变换后的音频时域数据,该数据的音色特征与给定的目标音色特征一致,语义特征与语义编码器输入的音频时域数据一致。
图6示出了音色转换音频生成的示意图。如图6所示,通过已经训练好的WaveNet分类模型得到目标说话人的音色特征,然后将该特征与原始音频时域数据一并输入到音色转换的音频生成网络,从而生成音色变换后的音频时域数据,能够基于已训练好的模型实现单一音频文件到多种音色的音频文件转换,无需针对不同转换场景重新训练网络模型,模型的泛化性能强。
需要说明的是,本领域内技术人员可以理解,本公开实施例上述涉及的各种实施方式/实施例中可以配合前述的实施例使用,也可以是独立使用。无论是单独使用还是配合前述的实施例一起使用,其实现原理类似。本公开实施中,部分实施例中是以一起使用的实施 方式进行说明的。当然,本领域内技术人员可以理解,这样的举例说明并非对本公开实施例的限定。
基于相同的构思,本公开实施例还提供一种生成音频的装置。
可以理解的是,本公开实施例提供的生成音频的装置为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。结合本公开实施例中所公开的各示例的单元及算法步骤,本公开实施例能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。本领域技术人员可以对每个特定的应用来使用不同的方法来实现所描述的功能,但是这种实现不应认为超出本公开实施例的技术方案的范围。
图7是根据一示例性实施例示出的一种生成音频的装置框图。参照图7,该装置100包括获取单元101、提取单元102和生成单元103。
获取单元101,用于获取原始音频时域数据;提取单元102,用于提取原始音频时域数据的音色特征,得到原始音色特征;生成单元103,用于基于原始音频时域数据、原始音色特征以及目标音色特征,生成目标音色音频时域数据,目标音色音频时域数据中的语义特征匹配原始音频时域数据的语义特征,目标音色音频时域数据中的音色特征匹配目标音色特征。
一种实施方式中,生成单元103采用如下方式基于原始音频时域数据、原始音色特征以及目标音色特征,生成目标音色音频时域数据:基于原始音频时域数据、原始音色特征以及目标音色特征,以及预先训练的音频生成网络模型,生成目标音色音频时域数据;音频生成网络模型用于对音频时域数据进行音色转换生成音色转换后的音频时域数据。
一种实施方式中,生成单元103采用如下方式基于原始音频时域数据、原始音色特征以及目标音色特征,以及预先训练的音频生成网络模型,生成目标音色音频时域数据:基于原始音频时域数据,以及音频生成网络模型中包括的语义编码器,得到原始音频时域数据的语义特征;基于语义特征、原始音色特征、目标音色特征、以及音频生成模型中包括的生成器,生成目标音色音频时域数据;生成器用于基于语义特征和音色特征生成音频时域数据。
一种实施方式中,生成单元103采用如下方式基于原始音频时域数据,以及音频生成网络模型中包括的语义编码器,得到原始音频时域数据的语义特征:将原始音频时域数据输入至音频生成网络模型中包括的语义编码器,并将语义编码器输出的语义特征输入至音频生成网络模型中包括的音色类别分类器;音色类别分类器用于识别输入语义特征的音色类别;基于音色类别分类器的输出对语义编码器进行约束,以使语义编码器输出的语义特 征中不包含任何音色特征,得到原始音频时域数据的语义特征。
一种实施方式中,生成器采用如下方式预先训练:将第一音频时域数据输入至语义编码器,获得第一音频语义特征,并将第一音频语义特征和目标音色特征,输入至预测生成器,得到目标音色音频时域预测数据;将目标音色音频时域预测数据输入至语义编码器,获得目标音色音频时域预测数据的语义特征,并将语义特征与第一音频的音色特征输入至预测生成器,得到第二音频时域预测数据;基于第一音频时域数据和对应的音色特征,通过预设判别器,确定真/伪对抗损失、以及音色特征回归损失;基于目标音色音频时域预测数据和目标音色特征,通过判别器,确定真/伪对抗损失、以及音色特征回归损失;基于第一音频时域数据、第二音频时域预测数据,确定重建损失;基于真/伪对抗损失、音色特征回归损失以及重建损失,对预测生成器的训练进行约束,得到满足约束条件的生成器。
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。
图8是根据一示例性实施例示出的一种生成音频的装置的框图。例如,装置200可以是移动电话,计算机,数字广播终端,消息收发设备,游戏控制台,平板设备,医疗设备,健身设备,个人数字助理等。
参照图8,装置200可以包括以下一个或多个组件:处理组件202,存储器204,电力组件206,多媒体组件208,音频组件210,输入/输出(I/O)接口212,传感器组件214,以及通信组件216。
处理组件202通常控制装置200的整体操作,诸如与显示,电话呼叫,数据通信,相机操作和记录操作相关联的操作。处理组件202可以包括一个或多个处理器220来执行指令,以完成上述的方法的全部或部分步骤。此外,处理组件202可以包括一个或多个模块,便于处理组件202和其他组件之间的交互。例如,处理组件202可以包括多媒体模块,以方便多媒体组件208和处理组件202之间的交互。
存储器204被配置为存储各种类型的数据以支持在装置200的操作。这些数据的示例包括用于在装置200上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。存储器204可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。
电力组件206为装置200的各种组件提供电力。电力组件206可以包括电源管理系统,一个或多个电源,及其他与为装置200生成、管理和分配电力相关联的组件。
多媒体组件208包括在所述装置200和用户之间的提供一个输出接口的屏幕。在一些实施例中,屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。所述触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与所述触摸或滑动操作相关的持续时间和压力。在一些实施例中,多媒体组件208包括一个前置摄像头和/或后置摄像头。当装置200处于操作模式,如拍摄模式或视频模式时,前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。
音频组件210被配置为输出和/或输入音频信号。例如,音频组件210包括一个麦克风(MIC),当装置200处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器204或经由通信组件216发送。在一些实施例中,音频组件210还包括一个扬声器,用于输出音频信号。
I/O接口212为处理组件202和外围接口模块之间提供接口,上述外围接口模块可以是键盘,点击轮,按钮等。这些按钮可包括但不限于:主页按钮、音量按钮、启动按钮和锁定按钮。
传感器组件214包括一个或多个传感器,用于为装置200提供各个方面的状态评估。例如,传感器组件214可以检测到装置200的打开/关闭状态,组件的相对定位,例如所述组件为装置200的显示器和小键盘,传感器组件214还可以检测装置200或装置200一个组件的位置改变,用户与装置200接触的存在或不存在,装置200方位或加速/减速和装置200的温度变化。传感器组件214可以包括接近传感器,被配置用来在没有任何的物理接触时检测附近物体的存在。传感器组件214还可以包括光传感器,如CMOS或CCD图像传感器,用于在成像应用中使用。在一些实施例中,该传感器组件214还可以包括加速度传感器,陀螺仪传感器,磁传感器,压力传感器或温度传感器。
通信组件216被配置为便于装置200和其他设备之间有线或无线方式的通信。装置200可以接入基于通信标准的无线网络,如WiFi,4G或5G,或它们的组合。在一个示例性实施例中,通信组件216经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,所述通信组件216还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其他技术来实现。
在示例性实施例中,装置200可以被一个或多个应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程 门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现,用于执行上述方法。
在示例性实施例中,还提供了一种包括指令的非临时性计算机可读存储介质,例如包括指令的存储器204,上述指令可由装置200的处理器220执行以完成上述方法。例如,所述非临时性计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。
可以理解的是,本公开中“多个”是指两个或两个以上,其它量词与之类似。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。
进一步可以理解的是,术语“第一”、“第二”等用于描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开,并不表示特定的顺序或者重要程度。实际上,“第一”、“第二”等表述完全可以互换使用。例如,在不脱离本公开范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。
进一步可以理解的是,除非有特殊说明,“连接”包括两者之间不存在其他构件的直接连接,也包括两者之间存在其他元件的间接连接。
进一步可以理解的是,本公开实施例中尽管在附图中以特定的顺序描述操作,但是不应将其理解为要求按照所示的特定顺序或是串行顺序来执行这些操作,或是要求执行全部所示的操作以得到期望的结果。在特定环境中,多任务和并行处理可能是有利的。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本公开的其它实施方案。本申请旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本公开的真正范围和精神由下面的权利范围指出。
应当理解的是,本公开并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利范围来限制。

Claims (12)

  1. 一种生成音频的方法,其特征在于,所述方法包括:
    获取原始音频时域数据;
    提取所述原始音频时域数据的音色特征,得到原始音色特征;
    基于所述原始音频时域数据、所述原始音色特征以及目标音色特征,生成目标音色音频时域数据,所述目标音色音频时域数据中的语义特征匹配所述原始音频时域数据的语义特征,所述目标音色音频时域数据中的音色特征匹配所述目标音色特征。
  2. 根据权利要求1所述的方法,其特征在于,所述基于所述原始音频时域数据、所述原始音色特征以及目标音色特征,生成目标音色音频时域数据,包括:
    基于所述原始音频时域数据、所述原始音色特征以及目标音色特征,以及预先训练的音频生成网络模型,生成目标音色音频时域数据;
    所述音频生成网络模型用于对音频时域数据进行音色转换生成音色转换后的音频时域数据。
  3. 根据权利要求2所述的方法,其特征在于,所述基于所述原始音频时域数据、所述原始音色特征以及目标音色特征,以及预先训练的音频生成网络模型,生成目标音色音频时域数据,包括:
    基于所述原始音频时域数据,以及音频生成网络模型中包括的语义编码器,得到所述原始音频时域数据的语义特征;
    基于所述语义特征、所述原始音色特征、所述目标音色特征、以及所述音频生成网络模型中包括的生成器,生成目标音色音频时域数据;
    所述生成器用于基于语义特征和音色特征生成音频时域数据。
  4. 根据权利要求3所述的方法,其特征在于,所述基于所述原始音频时域数据,以及音频生成网络模型中包括的语义编码器,得到所述原始音频时域数据的语义特征,包括:
    将所述原始音频时域数据输入至音频生成网络模型中包括的语义编码器,并将所述语义编码器输出的语义特征输入至所述音频生成网络模型中包括的音色类别分类器;
    所述音色类别分类器用于识别输入语义特征的音色类别;
    基于所述音色类别分类器的输出对所述语义编码器进行约束,以使所述语义编码器输出的语义特征中不包含任何音色特征,得到所述原始音频时域数据的语义特征。
  5. 根据权利要求3或4所述的方法,其特征在于,所述生成器采用如下方式预先训练:
    将第一音频时域数据输入至所述语义编码器,获得第一音频语义特征,并将所述第一 音频语义特征和目标音色特征,输入至预测生成器,得到目标音色音频时域预测数据;
    将目标音色音频时域预测数据输入至所述语义编码器,获得目标音色音频时域预测数据的语义特征,并将所述语义特征与第一音频的音色特征输入至预测生成器,得到第二音频时域预测数据;
    基于所述第一音频时域数据和对应的音色特征,通过预设判别器,确定真/伪对抗损失、以及音色特征回归损失;
    基于目标音色音频时域预测数据和目标音色特征,通过判别器,确定真/伪对抗损失、以及音色特征回归损失;
    基于所述第一音频时域数据、所述第二音频时域预测数据,确定重建损失;
    基于所述真/伪对抗损失、所述音色特征回归损失以及所述重建损失,对所述预测生成器的训练进行约束,得到满足约束条件的生成器。
  6. 一种生成音频的装置,其特征在于,包括:
    获取单元,用于获取原始音频时域数据;
    提取单元,用于提取所述原始音频时域数据的音色特征,得到原始音色特征;
    生成单元,用于基于所述原始音频时域数据、所述原始音色特征以及目标音色特征,生成目标音色音频时域数据,所述目标音色音频时域数据中的语义特征匹配所述原始音频时域数据的语义特征,所述目标音色音频时域数据中的音色特征匹配所述目标音色特征。
  7. 根据权利要求6所述的装置,其特征在于,所述生成单元采用如下方式基于所述原始音频时域数据、所述原始音色特征以及目标音色特征,生成目标音色音频时域数据:
    基于所述原始音频时域数据、所述原始音色特征以及目标音色特征,以及预先训练的音频生成网络模型,生成目标音色音频时域数据;
    所述音频生成网络模型用于对音频时域数据进行音色转换生成音色转换后的音频时域数据。
  8. 根据权利要求7所述的装置,其特征在于,所述生成单元采用如下方式基于所述原始音频时域数据、所述原始音色特征以及目标音色特征,以及预先训练的音频生成网络模型,生成目标音色音频时域数据:
    基于所述原始音频时域数据,以及音频生成网络模型中包括的语义编码器,得到所述原始音频时域数据的语义特征;
    基于所述语义特征、所述原始音色特征、所述目标音色特征、以及所述音频生成网络模型中包括的生成器,生成目标音色音频时域数据;
    所述生成器用于基于语义特征和音色特征生成音频时域数据。
  9. 根据权利要求8所述的装置,其特征在于,所述生成单元采用如下方式基于所述原始音频时域数据,以及音频生成网络模型中包括的语义编码器,得到所述原始音频时域数据的语义特征:
    将所述原始音频时域数据输入至音频生成网络模型中包括的语义编码器,并将所述语义编码器输出的语义特征输入至所述音频生成网络模型中包括的音色类别分类器;
    所述音色类别分类器用于识别输入语义特征的音色类别;
    基于所述音色类别分类器的输出对所述语义编码器进行约束,以使所述语义编码器输出的语义特征中不包含任何音色特征,得到所述原始音频时域数据的语义特征。
  10. 根据权利要求8或9所述的装置,其特征在于,所述生成器采用如下方式预先训练:
    将第一音频时域数据输入至所述语义编码器,获得第一音频语义特征,并将所述第一音频语义特征和目标音色特征,输入至预测生成器,得到目标音色音频时域预测数据;
    将目标音色音频时域预测数据输入至所述语义编码器,获得目标音色音频时域预测数据的语义特征,并将所述语义特征与第一音频的音色特征输入至预测生成器,得到第二音频时域预测数据;
    基于所述第一音频时域数据和对应的音色特征,通过预设判别器,确定真/伪对抗损失、以及音色特征回归损失;
    基于目标音色音频时域预测数据和目标音色特征,通过判别器,确定真/伪对抗损失、以及音色特征回归损失;基于所述第一音频时域数据、所述第二音频时域预测数据,确定重建损失;
    基于所述真/伪对抗损失、所述音色特征回归损失以及所述重建损失,对所述预测生成器的训练进行约束,得到满足约束条件的生成器。
  11. 一种生成音频的装置,其特征在于,包括:
    处理器;
    用于存储处理器可执行指令的存储器;
    其中,所述处理器被配置为执行权利要求1至5中任意一项所述的方法。
  12. 一种存储介质,其特征在于,所述存储介质中存储有指令,当所述存储介质中的指令由处理器执行时,使得处理器能够执行权利要求1至5中任意一项所述的方法。
PCT/CN2022/097437 2022-06-07 2022-06-07 一种生成音频的方法、装置及存储介质 WO2023236054A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2022/097437 WO2023236054A1 (zh) 2022-06-07 2022-06-07 一种生成音频的方法、装置及存储介质
CN202280004612.0A CN117546238A (zh) 2022-06-07 2022-06-07 一种生成音频的方法、装置及存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/097437 WO2023236054A1 (zh) 2022-06-07 2022-06-07 一种生成音频的方法、装置及存储介质

Publications (1)

Publication Number Publication Date
WO2023236054A1 true WO2023236054A1 (zh) 2023-12-14

Family

ID=89117356

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/097437 WO2023236054A1 (zh) 2022-06-07 2022-06-07 一种生成音频的方法、装置及存储介质

Country Status (2)

Country Link
CN (1) CN117546238A (zh)
WO (1) WO2023236054A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111771213A (zh) * 2018-02-16 2020-10-13 杜比实验室特许公司 语音风格迁移
CN112164407A (zh) * 2020-09-22 2021-01-01 腾讯音乐娱乐科技(深圳)有限公司 音色转换方法及装置
CN112382271A (zh) * 2020-11-30 2021-02-19 北京百度网讯科技有限公司 语音处理方法、装置、电子设备和存储介质
CN114333865A (zh) * 2021-12-22 2022-04-12 广州市百果园网络科技有限公司 一种模型训练以及音色转换方法、装置、设备及介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111771213A (zh) * 2018-02-16 2020-10-13 杜比实验室特许公司 语音风格迁移
CN112164407A (zh) * 2020-09-22 2021-01-01 腾讯音乐娱乐科技(深圳)有限公司 音色转换方法及装置
CN112382271A (zh) * 2020-11-30 2021-02-19 北京百度网讯科技有限公司 语音处理方法、装置、电子设备和存储介质
CN114333865A (zh) * 2021-12-22 2022-04-12 广州市百果园网络科技有限公司 一种模型训练以及音色转换方法、装置、设备及介质

Also Published As

Publication number Publication date
CN117546238A (zh) 2024-02-09

Similar Documents

Publication Publication Date Title
US11854563B2 (en) System and method for creating timbres
CN104080024B (zh) 音量校平器控制器和控制方法以及音频分类器
WO2020177190A1 (zh) 一种处理方法、装置及设备
CN108922525B (zh) 语音处理方法、装置、存储介质及电子设备
CN111508511A (zh) 实时变声方法及装置
WO2020145353A1 (ja) コンピュータプログラム、サーバ装置、端末装置及び音声信号処理方法
CN111583944A (zh) 变声方法及装置
CN109346076A (zh) 语音交互、语音处理方法、装置和系统
US20130211826A1 (en) Audio Signals as Buffered Streams of Audio Signals and Metadata
MX2011012749A (es) Sistema y metodo para recibir, analizar y editar audio para crear composiciones musicales.
CN110097890A (zh) 一种语音处理方法、装置和用于语音处理的装置
TWI742486B (zh) 輔助歌唱系統、輔助歌唱方法及其非暫態電腦可讀取記錄媒體
WO2022089097A1 (zh) 音频处理方法、装置及电子设备和计算机可读存储介质
WO2020211006A1 (zh) 语音识别方法、装置、存储介质及电子设备
WO2022242706A1 (zh) 基于多模态的反应式响应生成
KR20200067382A (ko) 사운드를 출력하기 위한 전자 장치 및 그의 동작 방법
JP2023527473A (ja) オーディオ再生方法、装置、コンピュータ可読記憶媒体及び電子機器
JP6701478B2 (ja) 映像生成装置、映像生成モデル学習装置、その方法、及びプログラム
WO2023236054A1 (zh) 一种生成音频的方法、装置及存储介质
JP2021117371A (ja) 情報処理装置、情報処理方法および情報処理プログラム
WO2023087932A1 (zh) 虚拟演唱会的处理方法、装置、设备、存储介质及程序产品
TWI377559B (en) Singing system with situation sound effect and method thereof
CN111696566B (zh) 语音处理方法、装置和介质
WO2022041177A1 (zh) 通信消息处理方法、设备及即时通信客户端
TWI486949B (zh) Music emotion classification method

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 202280004612.0

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22945201

Country of ref document: EP

Kind code of ref document: A1