CN117059106A

CN117059106A - Sound effect audio generation method and device for audio book and readable storage medium

Info

Publication number: CN117059106A
Application number: CN202311157409.XA
Authority: CN
Inventors: 庄晓滨
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2023-09-08
Filing date: 2023-09-08
Publication date: 2023-11-14

Abstract

The embodiment of the application discloses an audio generation method, audio generation equipment and a computer-readable storage medium for an audio of a voice book. The method of the embodiment of the application comprises the following steps: obtaining a target text of a sound book, inputting the target text into a pre-trained text encoder, extracting scene content of the target text by the pre-trained text encoder to obtain text characterization output by the text encoder, inputting the text characterization into a pre-trained diffusion model, sequentially denoising the text characterization by N sub-diffusion models of the pre-trained diffusion model to obtain N denoised text characterizations output by the diffusion model; and N is an integer greater than or equal to 2, and the text representation after the denoising treatment is input into an audio decoder to obtain target sound effect audio corresponding to the scene of the target text output by the audio decoder.

Description

Sound effect audio generation method and device for audio book and readable storage medium

Technical Field

The embodiment of the application relates to the field of sound effect audio generation of a sound book, in particular to a sound effect audio generation method, sound effect audio generation equipment and a computer readable storage medium of the sound book.

Background

With the rapid development of information technology and digital technology, digital reading has become one of the most important reading modes for readers in China, and voice books and carriers thereof closely related to the digital reading modes have also been developed. Most of traditional audio books are recorded by human reading, and the productivity is limited. In recent years, high expressive voice synthesis technology has been well developed, and has high application value in the aspect of making audio books, but all audio books have single content forms at present, and basically only have human voice or background music. In order to enhance the liveliness of the audio book, bring better hearing experience to the user experience, and the audio generation of the audio book is required.

The existing sound effect audio generation method of the audio book generally selects proper sound effect materials from a sound effect material library manually or semi-automatically, and then adds the selected sound effect materials into the audio book. Specifically, a method of selecting suitable audio material from an audio material library may be described with reference to patent CN 115428469A, patent CN 115428469A devised an audio recommendation for visual input generated by training a machine learning model that learns coarse-granularity and fine-granularity audio-visual correlations from reference visual, positive and negative audio signals, the trained acoustic recommendation network being configured to output audio embedding and visual embedding and to calculate a correlation distance between an image frame or video clip and one or more audio clips retrieved from a database using the audio embedding and visual embedding, rank the correlation distances of the one or more audio clips in the database, and determine the one or more audio clips having closest correlation distances to the ranked audio correlation distances, and apply the audio clip having the closest audio correlation distances to the input image frame or video clip.

The method is a matching idea, and the similarity between the input quantity (image frame or video fragment or text) of the model and one or more audio fragments (such as audio) in the database is estimated by calculating the correlation distance between the input quantity (image frame or video fragment or text) of the model and one or more audio fragments (such as audio) in the database, however, the estimated algorithm is complex and tedious and easy to calculate errors, so that the similarity estimation method has lower accuracy, the matching audio has lower accuracy, and the audio generation of the audio book has lower accuracy.

Disclosure of Invention

The embodiment of the application provides an audio generation method, audio generation equipment and a computer readable storage medium for an audio of an audio book, which can generate the audio of the audio book under the condition of improving the accuracy of generating the audio of the audio book.

In a first aspect, an embodiment of the present application provides a method for generating audio effects of an audio book, including:

obtaining a target text of a sound book;

inputting the target text into a pre-trained text encoder, and extracting scene content of the target text by the pre-trained text encoder to obtain text representation output by the text encoder;

Inputting the text representation into a pre-trained diffusion model, and sequentially denoising the text representation by N sub-diffusion models of the pre-trained diffusion model to obtain N times of denoised text representations output by the diffusion model; wherein N is an integer greater than or equal to 2;

and inputting the text representation after the denoising processing into an audio decoder to obtain target sound effect audio corresponding to the scene of the target text output by the audio decoder.

Optionally, before the target text is input into the pre-trained text encoder, the method includes:

obtaining a text sample, wherein the text sample is marked with scene sound effect characterization;

inputting the text sample into a text encoder to obtain a predicted text representation output by the text encoder;

and calculating the loss between the predicted text representation corresponding to each text sample and the annotated scene sound effect representation according to the regression loss function, and obtaining the trained text encoder when the loss meets the convergence condition.

Optionally, before the obtaining the text sample, the text sample is labeled with the scene sound effect representation, the method further includes:

Constructing a database, wherein the database comprises text samples, and the text samples are marked with audio samples; wherein the text sample comprises at least one sound effect text and/or at least one non-sound effect text; the audio samples include at least one audio effect audio and/or at least one mute audio; each sound effect text is marked with corresponding sound effect audio, and each non-sound effect text is marked with corresponding mute audio;

the obtaining a text sample, the text sample being annotated with a scene sound characterization, includes:

inputting the audio sample into an audio encoder, and extracting scene content of at least one audio effect audio and/or at least one mute audio of the audio sample by the audio encoder to obtain an audio representation corresponding to each audio effect audio and/or an audio representation corresponding to each mute audio output by the audio encoder; the audio characterization corresponding to each audio effect and/or the audio characterization corresponding to each mute audio output by the audio encoder is marked scene audio characterization;

inputting the text sample into a text encoder to obtain a predicted text representation output by the text encoder, wherein the method comprises the following steps:

Inputting the text sample into a text encoder, and extracting scene content of at least one sound effect text and/or at least one non-sound effect text of the text sample by the text encoder to obtain a predicted text representation corresponding to each sound effect text and/or a predicted text representation corresponding to each non-sound effect text output by the text encoder;

calculating the loss between the predicted text representation corresponding to each text sample and the annotated scene sound effect representation according to the regression loss function, and obtaining a trained text encoder when the loss meets the convergence condition, wherein the method comprises the following steps:

and calculating a first loss between the predicted text representation corresponding to each sound effect text and the audio representation corresponding to each sound effect audio and/or a second loss between the predicted text representation of each non-sound effect text and the audio representation corresponding to each mute audio according to a regression loss function, and obtaining the trained text encoder when the first loss and/or the second loss meet a convergence condition.

Optionally, the calculating a first loss between the predicted text representation corresponding to each sound effect text and the audio representation corresponding to each sound effect audio according to the regression loss function, and/or a second loss between the predicted text representation of each non-sound effect text and the audio representation corresponding to each mute audio, when the first loss and/or the second loss meet the convergence condition, obtaining a trained text encoder includes:

For each sound effect text, calculating third loss between the predicted text representation corresponding to the sound effect text and the audio representation corresponding to the sound effect audio marked by the sound effect text according to a regression loss function, and calculating fourth loss between the predicted text representation corresponding to the sound effect text and the audio representation corresponding to the sound effect audio except the sound effect audio marked by the sound effect text; and/or

For each non-sound effect text, calculating a fifth loss between a predicted text representation corresponding to the non-sound effect text and an audio representation corresponding to the mute audio marked by the non-sound effect text according to a regression loss function, and calculating a sixth loss between the predicted text representation corresponding to the non-sound effect text and an audio representation corresponding to the sound effect audio except for the mute audio marked by the non-sound effect text;

when the third loss and/or the fifth loss satisfy a first convergence condition and the fourth loss and/or the sixth loss satisfy a second convergence condition, determining that the first loss and/or the second loss satisfy a convergence condition to obtain a trained text encoder.

Optionally, before the audio samples are input to the audio encoder, the method further comprises:

Carrying out uniform sampling pretreatment and/or volume equalization pretreatment of a preset sampling rate on at least one sound effect audio and/or at least one mute audio of the audio sample to obtain at least one pretreated sound effect audio and/or at least one mute audio so as to obtain a pretreated audio sample;

the inputting the audio samples into an audio encoder comprises:

and inputting the preprocessed audio samples into an audio encoder.

Optionally, the N sub-diffusion models include a noise adding module, a noise estimating model and a noise removing module, which are respectively corresponding to each other;

the method further comprises the steps of:

obtaining an audio characterization sample; the audio characterization sample is marked with text characterization and real noise of N time steps;

inputting the audio characterization sample into a diffusion model, sequentially carrying out noise adding treatment on the audio characterization sample by a noise adding module corresponding to each of N sub-diffusion models of the diffusion model to obtain a noise-added audio characterization, sequentially estimating noise of the noise-added audio characterization according to a relation between the noise-added audio characterization sample and a marked text characterization by a noise estimation model corresponding to each of N sub-diffusion models to obtain a predicted noise value, and sequentially carrying out noise removing treatment on the noise-added audio characterization according to the predicted noise value by a noise removing module corresponding to each of N sub-diffusion models to obtain a predicted audio characterization subjected to N times of noise removing treatment;

And sequentially calculating losses between the predicted noise values corresponding to the N sub-diffusion models and the actual noise values of the marked N time steps according to a regression loss function, and obtaining the noise estimation models corresponding to the N sub-diffusion models after training when the losses corresponding to the N sub-diffusion models meet convergence conditions so as to obtain the diffusion model after training according to the noise estimation models corresponding to the N sub-diffusion models after training.

Optionally, the audio characterization sample is input into a diffusion model, the noise adding modules corresponding to the N sub-diffusion models of the diffusion model sequentially perform noise adding processing on the audio characterization sample to obtain a noise-added audio characterization, the noise estimation models corresponding to the N sub-diffusion models sequentially estimate noise of the noise-added audio characterization according to a relation between the noise-added audio characterization sample and the marked text characterization to obtain a predicted noise value, and the noise removing modules corresponding to the N sub-diffusion models sequentially perform noise removing processing on the noise-added audio characterization according to the predicted noise value to obtain a noise-removed predicted audio characterization, including:

Inputting the audio representation sample into a time-step diffusion model 1, carrying out noise adding processing on the audio representation sample by a noise adding module of the time-step diffusion model 1 to obtain a noise-added audio representation, estimating noise of the noise-added audio representation by a noise estimation model of the time-step diffusion model 1 according to the relation between the noise-added audio representation and the marked text representation to obtain a predicted noise value, and carrying out noise removing processing on the noise-added audio representation by a noise removing module of the time-step diffusion model 1 according to the predicted noise value to output a time-step predicted audio representation;

inputting a t-1 time step predicted audio representation output by a t-1 time step diffusion model into the t time step diffusion model for the t time step diffusion model, carrying out noise adding processing on the t-1 time step predicted audio representation by a noise adding module of the t time step diffusion model to obtain a noise-added audio representation, estimating noise of the noise-added audio representation by a t time step noise estimation model of the t time step diffusion model according to the relation between the noise-added audio representation and a marked text representation to obtain a predicted noise value, and carrying out noise removing processing on the noise-added audio representation according to the predicted noise value to output a t time step predicted audio representation; the t-th time step audio is characterized by the predicted audio after the N times of denoising treatment, and t is more than or equal to 2 and less than or equal to N.

Optionally, the calculating, according to a regression loss function, the loss between the predicted noise value corresponding to each of the N sub-diffusion models and the actual noise value of the marked N time steps sequentially, and when the loss corresponding to each of the N sub-diffusion models meets a convergence condition, obtaining a noise estimation model corresponding to each of the N trained sub-diffusion models, so as to obtain the trained diffusion model according to the noise estimation model corresponding to each of the N trained sub-diffusion models, where the calculating includes:

for the noise estimation model of the 1 st time step diffusion model, calculating 1 st loss between a predicted noise value corresponding to the noise estimation model of the 1 st time step diffusion model and real noise corresponding to the marked 1 st time step according to a regression loss function; when the 1 st loss meets the convergence condition, obtaining a noise estimation model of the trained 1 st time step diffusion model;

for the noise estimation model of the t-th time step diffusion model, calculating a t-th loss between a predicted noise value corresponding to the noise estimation model of the t-th time step diffusion model and real noise corresponding to the marked t-th time step according to a regression loss function; when the t loss meets the convergence condition, obtaining a noise estimation model of the trained t time step diffusion model;

And obtaining the diffusion model after training according to the noise estimation model of the diffusion model after training and/or the noise estimation model of the diffusion model after training.

In a second aspect, an embodiment of the present application provides an audio-effect audio generating apparatus, including:

an obtaining unit configured to obtain a target text of a voice book;

the extraction unit is used for inputting the target text into a pre-trained text encoder, and extracting scene content of the target text by the pre-trained text encoder to obtain text representation output by the text encoder;

the denoising unit is used for inputting the text representation into a pre-trained diffusion model, and denoising the text representation by the N sub-diffusion models of the pre-trained diffusion model in sequence to obtain N times of denoised text representations output by the diffusion model; wherein N is an integer greater than or equal to 2;

and the decoding unit is used for inputting the text representation after the denoising processing into an audio decoder to obtain target sound effect audio corresponding to the scene of the target text output by the audio decoder.

In a third aspect, an embodiment of the present application provides an audio-effect audio generating apparatus, including:

the device comprises a central processing unit, a memory, an input/output interface, a wired or wireless network interface and a power supply;

the memory is a short-term memory or a persistent memory;

the central processing unit is configured to communicate with the memory and execute instruction operations in the memory to perform the aforementioned audio generation method of the audio book.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the aforementioned method of generating sound effects of a sound book.

In a fifth aspect, embodiments of the present application provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of generating sound effects audio of a vocal book as described above.

From the above technical solutions, the embodiment of the present application has the following advantages: the target text of the audio book can be input into a pre-trained text encoder, scene content of the target text can be extracted to obtain text characterization, the text characterization can be input into a pre-trained diffusion model, N times of denoising processing can be performed on the text characterization, the text characterization after the N times of denoising processing can be input into an audio decoder, and target audio corresponding to the scene of the target text output by the audio decoder can be obtained. The method is a generated thought, replaces a matched thought, can extract text tokens, can denoise the text tokens, has higher accuracy of the obtained text tokens, generates target sound effect audio according to the text tokens with higher accuracy, and can avoid the problem of incorrect matching caused by low accuracy of a similarity estimation method of the matched thought.

Drawings

Fig. 1 is a schematic architecture diagram of an audio generating system for audio books according to an embodiment of the present application;

fig. 2 is a flow chart of a method for generating sound effect and audio of a voice book according to an embodiment of the present application;

FIG. 3 is a flow chart of a method for training a text encoder according to an embodiment of the present application;

FIG. 4 is a schematic flow chart of a method for training a diffusion model according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an audio generating device according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of another audio generating device according to an embodiment of the present application.

Detailed Description

The embodiment of the application provides sound effect audio generation of a sound book, sound effect audio generation equipment and a computer readable storage medium, which are used for generating sound effect audio of the sound book under the condition of improving the accuracy of sound effect audio generation of the sound book.

Referring to fig. 1, an architecture of an audio generating system for audio books according to an embodiment of the present application includes:

an audio-effect audio generating device 101 and a client 102. When the audio generation of the audio book is performed, the audio generation device 101 may be connected to the client 102. The client 102 may send an audio generation request of a target text of the audio book to the audio generation device 101, the audio generation device 101 may input the target text into a pre-trained text encoder to obtain a text token, input the text token into a pre-trained diffusion model to obtain a text token after N times of denoising, input the denoised text token into an audio decoder to obtain a target audio, and the audio generation device 101 may send the target audio to the client 102.

Referring to fig. 2, fig. 2 is a schematic flow chart of an audio method of a voice book according to an embodiment of the present application, where the method includes:

201. target text of the audio book is obtained.

In this embodiment, when the audio generation of the audio book is performed, the target text of the audio book can be obtained.

202. Inputting the target text into a pre-trained text encoder, and extracting scene content of the target text by the pre-trained text encoder to obtain text representation output by the text encoder.

After the target text of the audio book is obtained, the target text can be input into a pre-trained text encoder, and scene content of the target text is extracted by the pre-trained text encoder to obtain a text representation output by the text encoder.

Before inputting a target text into a pre-trained text encoder, the text encoder can be trained, specifically, the method for training the text encoder can be that a text sample is firstly obtained, the text sample is marked with scene sound effect representation, then the text sample is input into the text encoder to obtain a predicted text representation output by the text encoder, then loss between the predicted text representation corresponding to each text sample and the marked scene sound effect representation is calculated according to a regression loss function, and when the loss meets a convergence condition, the trained text encoder is obtained. The method for obtaining the text sample, wherein the text sample is marked with the scene sound effect representation comprises the steps of firstly constructing a database, wherein the database comprises the text sample, and the text sample is marked with an audio sample; wherein the text sample comprises at least one sound effect text and/or at least one non-sound effect text; the audio samples include at least one audio effect audio and/or at least one mute audio; each sound effect text is marked with corresponding sound effect audio, each non-sound effect text is marked with corresponding mute audio, then the audio sample is input into an audio encoder, and at least one sound effect audio and/or scene content of at least one mute audio of the audio sample are extracted by the audio encoder to obtain audio representation corresponding to each sound effect audio and/or audio representation corresponding to each mute audio output by the audio encoder; and the audio representation corresponding to each audio effect and/or the audio representation corresponding to each mute audio output by the audio encoder is the annotated scene audio representation. The method for obtaining the predicted text representation output by the text encoder by inputting the text sample into the text encoder includes that the text encoder extracts scene content of at least one sound effect text and/or at least one non-sound effect text of the text sample to obtain the predicted text representation corresponding to each sound effect text and/or the predicted text representation corresponding to each non-sound effect text output by the text encoder. The method for obtaining the trained text encoder may be that a first loss between the predicted text representation corresponding to each sound effect text and the audio representation corresponding to each sound effect audio and/or a second loss between the predicted text representation of each non-sound effect text and the audio representation corresponding to each mute audio is calculated according to the regression loss function, and when the first loss and/or the second loss meet the convergence condition, the trained text encoder is obtained. It will be appreciated that other reasonable methods than the method of training a text encoder described above are possible, and are not limited in this regard.

203. Inputting the text representation into a pre-trained diffusion model, and sequentially denoising the text representation by N sub-diffusion models of the pre-trained diffusion model to obtain N times of denoised text representations output by the diffusion model; wherein N is an integer greater than or equal to 2.

Inputting a target text into a pre-trained text encoder, extracting scene content of the target text by the pre-trained text encoder, obtaining text representation output by the text encoder, inputting the text representation into a pre-trained diffusion model, sequentially denoising the text representation by N sub-diffusion models of the pre-trained diffusion model, and obtaining N denoised text representations output by the diffusion model; wherein N is an integer greater than or equal to 2.

Wherein the diffusion model may be trained prior to inputting the text representation into the pre-trained diffusion model. Specifically, the method for training the diffusion model may be that an audio characterization sample is obtained first; the method comprises the steps that an audio representation sample is marked with text representation and real noise of N time steps, then the audio representation sample is input into a diffusion model, the noise adding modules corresponding to N sub-diffusion models of the diffusion model sequentially carry out noise adding processing on the audio representation sample to obtain the audio representation after the noise adding processing, the noise estimation models corresponding to the N sub-diffusion models sequentially estimate the noise of the audio representation after the noise adding processing according to the relation between the audio representation sample after the noise adding processing and the marked text representation to obtain a predicted noise value, the noise removing modules corresponding to the N sub-diffusion models sequentially carry out noise removing processing on the audio representation after the noise adding processing according to the predicted noise value to obtain the predicted audio representation after the N times of noise removing processing, finally, losses between the predicted noise value corresponding to the N sub-diffusion models and the real noise value of the N time steps marked are sequentially calculated according to a regression loss function, and when the losses corresponding to the N sub-diffusion models respectively meet convergence conditions, the trained noise estimation models corresponding to the N sub-diffusion models respectively are obtained, and the trained diffusion models corresponding to N sub-diffusion models are obtained according to the N sub-diffusion models respectively. Other reasonable methods of training the diffusion model are also possible, and are not limited in this regard.

204. And inputting the denoised text representation into an audio decoder to obtain target sound effect audio corresponding to the scene of the target text output by the audio decoder.

Inputting the target text into a pre-trained text encoder, extracting scene content of the target text by the pre-trained text encoder, obtaining text representation output by the text encoder, and inputting the text representation after denoising into an audio decoder to obtain target sound effect audio corresponding to the scene of the target text output by the audio decoder.

According to the embodiment of the application, the target text of the audio book can be input into the pre-trained text encoder, the scene content of the target text can be extracted to obtain the text representation, the text representation can be input into the pre-trained diffusion model, the text representation can be subjected to N times of denoising processing, the text representation subjected to N times of denoising processing can be input into the audio decoder, and the target sound effect audio corresponding to the scene of the target text output by the audio decoder can be obtained. The method is a generated thought, replaces a matched thought, can extract text tokens, can denoise the text tokens, has higher accuracy of the obtained text tokens, generates target sound effect audio according to the text tokens with higher accuracy, and can avoid the problem of incorrect matching caused by low accuracy of a similarity estimation method of the matched thought.

In the embodiment of the present application, before inputting the target text into the pre-trained text encoder, the text encoder may be trained, before inputting the text representation into the pre-trained diffusion model, the diffusion model may be trained, and there may be a plurality of specific methods for training the text encoder and the diffusion model, and one of the methods is described below based on the audio generation method of the audio book shown in fig. 1.

Wherein the text encoder may be trained prior to entering the target text into the pre-trained text encoder.

Specifically, the method for training the text encoder may be that a text sample is obtained first, the text sample is labeled with a scene sound effect representation, then the text sample is input into the text encoder to obtain a predicted text representation output by the text encoder, finally, a loss between the predicted text representation corresponding to each text sample and the labeled scene sound effect representation is calculated according to a regression loss function, and when the loss meets a convergence condition, the trained text encoder is obtained.

A database can be constructed before the text sample is marked with the scene sound effect representation, the database comprises the text sample, and the text sample is marked with the audio sample; wherein the text sample comprises at least one sound effect text and/or at least one non-sound effect text; the audio samples include at least one audio effect audio and/or at least one mute audio; each sound effect text is marked with corresponding sound effect audio, and each non-sound effect text is marked with corresponding mute audio. The method for obtaining the text sample, the text sample is marked with the scene sound effect representation, can be that the audio sample is input into an audio encoder, at least one sound effect audio and/or at least one mute audio of the audio sample are extracted by the audio encoder, and the audio representation corresponding to each sound effect audio and/or the audio representation corresponding to each mute audio output by the audio encoder are obtained; and the audio representation corresponding to each audio effect and/or the audio representation corresponding to each mute audio output by the audio encoder is the annotated scene audio representation.

Specifically, the training data contained in the database may include a text sample and an audio sample corresponding to the text sample, where the text sample may include at least one sound effect text and/or at least one non-sound effect text, and the audio sample may include at least one sound effect audio and/or at least one mute audio, for example, the sound effect text is the text content of "multi-person walking", the sound effect audio corresponding to the sound effect text is the audio data of "walking sound data", each text and the corresponding audio form a < text, audio > pair, the scene includes, but is not limited to, dog call, lightning, crying by a child, wind, rain, and the like, and the content described by the sound effect text includes, but is not limited to, dog call, flash sound, crying by a child, wind, rain, and the like. Specific ways of database construction include, but are not limited to, by recording or directly using the disclosed data sets. It should be understood that the data in the database may be used as the training data, where first, a plurality of scenes, for example, more than 500 scenes, are overlaid on the content; secondly, the data duration reaches a certain scale, such as more than 2000 hours; furthermore, the audio data does not include human voice data such as voice or singing voice data. It should be noted that, in order for the text encoder to automatically determine the non-sound effect text (non-sound effect text), a batch of < arbitrary text, mute > pairs needs to be constructed to extend the entire training set (database).

The method for obtaining the predicted text representation output by the text encoder by inputting the text sample into the text encoder includes that the text encoder extracts scene content of at least one sound effect text and/or at least one non-sound effect text of the text sample to obtain the predicted text representation corresponding to each sound effect text and/or the predicted text representation corresponding to each non-sound effect text output by the text encoder.

The method for obtaining the trained text encoder may be that a first loss between the predicted text representation corresponding to each sound effect text and the audio representation corresponding to each sound effect audio and/or a second loss between the predicted text representation of each non-sound effect text and the audio representation corresponding to each mute audio is calculated according to the regression loss function, and when the first loss and/or the second loss meet the convergence condition, the trained text encoder is obtained.

Calculating a first loss between a predicted text representation corresponding to each sound effect text and an audio representation corresponding to each sound effect audio according to a regression loss function, and/or a second loss between a predicted text representation of each non-sound effect text and an audio representation corresponding to each mute audio, when the first loss and/or the second loss meet a convergence condition, obtaining a trained text encoder, wherein for each sound effect text, a third loss between the predicted text representation corresponding to the sound effect text and the audio representation corresponding to the sound effect audio marked by the sound effect text is calculated according to the regression loss function, and a fourth loss between the predicted text representation corresponding to the sound effect text and the audio representation corresponding to the sound effect audio except the sound effect audio marked by the sound effect text is calculated; and/or for each non-sound effect text, calculating a fifth loss between the predicted text representation corresponding to the non-sound effect text and the audio representation corresponding to the mute audio marked by the non-sound effect text according to a regression loss function, and calculating a sixth loss between the predicted text representation corresponding to the non-sound effect text and the audio representation corresponding to the sound effect audio except for the mute audio marked by the non-sound effect text; when the third loss and/or the fifth loss satisfy a first convergence condition and the fourth loss and/or the sixth loss satisfy a second convergence condition, determining that the first loss and/or the second loss satisfy a convergence condition to obtain a trained text encoder.

Before the audio sample is input into the audio encoder, unified sampling preprocessing and/or volume balancing preprocessing of a preset sampling rate can be performed on at least one audio effect audio and/or at least one mute audio of the audio sample, so that at least one preprocessed audio effect audio and/or at least one preprocessed mute audio can be obtained, and a preprocessed audio sample can be obtained. The method of inputting the audio samples to the audio encoder may be that the preprocessed audio samples are input to the audio encoder.

Specifically, after the audio signal (at least one audio effect and/or at least one mute audio of the audio sample) is obtained, the sampling rate can be unified to a 24000Hz and other reasonable fixed values, and then the volume equalization processing is performed, so that the instability of the generation effect caused by the volume difference is avoided. It should be understood that, a general Audio generating task usually uses a mel spectrum as an Audio feature, and then uses a vocoder to restore the mel spectrum into Audio, while the Audio generated by the present invention is physical sound effect or background sound, and the general vocoder cannot meet the requirements on the synthesized sound quality, so that the present invention is innovatively implemented by using Audio neural network codec, and uses the Audio Encoder of EnCodec of Facebook to extract VQ (Vector Quantization) features as Audio features, where the Audio Encoder is represented by Audio Encoder, so as to improve the feasibility of extracting the scene content of the Audio sample and obtaining the Audio representation corresponding to the sound effect Audio.

Specifically, the method for training the text encoder is a cross-modal training method, and the specific cross-modal training method is to train only by freezing parameters of the audio encoder and only trimming the parameters of the text encoder. The function is to supervise the Text Encoder (Text Encoder) so that the Text representation output by the Text Encoder approximates the audio feature VQ output by the audio Encoder. Specifically, referring to fig. 3, fig. 3 is a schematic flow chart of a method for training a Text Encoder according to an embodiment of the present application, and as can be seen from fig. 3, an Audio Encoder (Audio Encoder) includes but is not limited to an Audio Encoder of EnCodec of Facebook, and a Text Encoder (Text Encoder) includes but is not limited to a Text pre-training model BERT. While training the Text Encoder (Text Encode), a batch of Text (at least one sound effect Text and/or at least one non-sound effect Text) may be input to the Text Encoder (Text Encode)er), the text characterization (text feature) obtained is T ₁ T ₂ ,…,T _n Text tokens are used to represent scene sound features determined from the content of the text. Inputting a batch of Audio signals (at least one Audio effect and/or at least one mute) to an Audio Encoder (Audio Encoder), the resulting Audio characterization (Audio feature) being A ₁ A ₂ ,…,A _n Audio characterization is used to represent scene sound effects characteristics determined from the content of the audio, including but not limited to, dog-ear, lightning, child crying, blowing, raining, etc. Then, a similarity matrix X is obtained using cosine similarity (cosine similarity) as a metric function. The objective function of contrast learning is such that the similarity is high when the audio representation and the text representation are from one data pair (datapair) and low when the audio representation and the text representation are from a different data pair. Thus element X on the diagonal of the similarity matrix ₁₁ X ₂₂ ,…,X _nn Is 1, and the element learning object on the off-diagonal line is 0. The regression loss function may be a cross entropy loss function and when the loss values of the text encoder substantially converge, training may be exited and the text encoder parameters saved. It should be noted that, in order to enhance the generalization capability of the model, any irrelevant text may be added to the text sample of the database. It should be appreciated that training a text encoder by freezing parameters of the audio encoder, and only fine tuning the parameters of the text encoder, may improve the training efficiency and the final training effect of the text encoder.

The method comprises the steps that after the text characterization is sequentially subjected to denoising processing by N sub-diffusion models of the pre-trained diffusion model, an audio characterization sample can be obtained before the text characterization after the N times of denoising processing is obtained, wherein the audio characterization sample is marked with text characterization and real noise of N time steps, then the audio characterization sample is input into the diffusion model, the noise adding modules corresponding to the N sub-diffusion models of the diffusion model sequentially carry out the denoising processing on the audio characterization sample to obtain audio characterization after the denoising processing, the noise estimation models corresponding to the N sub-diffusion models sequentially estimate noise of the noise characterization after the denoising processing according to the relation between the audio characterization sample after the denoising processing and the marked text characterization to obtain a predicted noise value, the denoising modules corresponding to the N sub-diffusion models sequentially carry out denoising processing on the audio characterization after the denoising processing according to the predicted noise value to obtain predicted audio characterization after the N times of denoising processing, finally, the N sub-diffusion models corresponding to N sub-diffusion models are sequentially calculated according to a regression loss function, and the N sub-diffusion models corresponding to achieve training loss conditions when the N sub-diffusion models corresponding to the N sub-diffusion models respectively correspond to each other, and the N diffusion models are completely trained according to the estimated diffusion models respectively, and the estimated diffusion models are completely trained according to the estimated and the estimated diffusion models respectively correspond to N sub-diffusion models are obtained.

The method for carrying out denoising processing on the audio characterization sample by the denoising modules corresponding to the N sub-diffusion models of the diffusion model respectively comprises the steps of sequentially carrying out denoising processing on the audio characterization sample by the denoising modules corresponding to the N sub-diffusion models to obtain a denoised audio characterization, carrying out denoising processing on the audio characterization sample by the denoising module of the 1 st time step sub-diffusion model respectively according to the relation between the denoised audio characterization sample and the marked text characterization to obtain a predicted noise value, carrying out denoising processing on the denoised audio characterization by the denoising module of the N sub-diffusion models respectively according to the predicted noise value to obtain a denoised predicted audio characterization by the denoising module of the 1 st time step sub-diffusion model according to the predicted noise value, and carrying out denoising processing on the audio characterization sample by the denoising module of the 1 st time step sub-diffusion model to obtain a denoised audio characterization by the predicted noise characterization by the denoising module according to the relation between the denoised audio characterization sample and the marked text characterization of the 1 st time step-diffusion model; inputting a t-1 time step predicted audio representation output by a t-1 time step diffusion model into the t time step diffusion model for the t time step diffusion model, carrying out noise adding processing on the t-1 time step predicted audio representation by a noise adding module of the t time step diffusion model to obtain a noise-added audio representation, estimating noise of the noise-added audio representation by a t time step noise estimation model of the t time step diffusion model according to the relation between the noise-added audio representation and a marked text representation to obtain a predicted noise value, and carrying out noise removing processing on the noise-added audio representation according to the predicted noise value to output a t time step predicted audio representation; the t-th time step audio is characterized by the predicted audio after the N times of denoising treatment, and t is more than or equal to 2 and less than or equal to N.

The method for obtaining the trained diffusion model according to the noise estimation model corresponding to each N sub-diffusion models after training can be that, for the noise estimation model of the 1 st time step sub-diffusion model, the 1 st loss between the predicted noise value corresponding to the noise estimation model of the 1 st time step sub-diffusion model and the real noise corresponding to the 1 st time step of the label is calculated according to the regression loss function; when the 1 st loss meets the convergence condition, obtaining a noise estimation model of the trained 1 st time step diffusion model; for the noise estimation model of the t-th time step diffusion model, calculating a t-th loss between a predicted noise value corresponding to the noise estimation model of the t-th time step diffusion model and real noise corresponding to the marked t-th time step according to a regression loss function; when the t loss meets the convergence condition, obtaining a noise estimation model of the trained t time step diffusion model; and then obtaining the diffusion model after training according to the noise estimation model of the diffusion model after training and/or the noise estimation model of the diffusion model after training.

It should be understood that the diffusion model may include N sub-diffusion models, where N is an integer greater than or equal to 2, so that the diffusion model is trained, that is, the N sub-diffusion models are trained, and each sub-diffusion model includes a diffusion process (noise adding process) and a reverse process (noise removing process), so that each sub-diffusion model needs to be trained, and only one of the sub-diffusion models may be trained, or all of the sub-diffusion models may be trained at the same time, where a specific training manner may be determined according to actual needs.

It is also understood that two steps of training are required for each sub-diffusion model: a diffusion process (noise adding process) which is a markov chain with fixed parameters, converts a complex data distribution into an isotropic gaussian distribution by adding noise step by step, and a reverse process (noise removing process) which is a markov chain implemented by a neural network, and learns to restore data with gaussian white noise step by step to original data. In the diffusion process, assuming that the number of diffusion steps is T, the audio of the T step is characterized as y _t The diffusion process can be expressed as formula one and formula two:

Wherein, E is _t Gaussian noise, beta, representing time step t _i For the noise strengths of the different time steps i,for the noise intensity of time step T, y can be obtained after T times of noise addition ₀ ,y ₁ ,…,y _T . In the reverse process, y is _t Step-wise denoising conversion to y _t-1 Is required to perform noise estimation with the Unet model. Let the audio feature signal of the t-1 be y _t-1 Then y _t-1 Can be expressed as equation three:

wherein, E is _θ(t) And predicting the obtained noise for the t-th step Unet.

As can be seen from the first, second and third formulas, training the sub-diffusion model is mainly to train the inverse process (denoising process) of the sub-diffusion model, and training the inverse process (denoising process) is mainly to train the noise estimation model of the sub-diffusion model to estimate the noise. Specifically, referring to fig. 4, fig. 4 is a schematic flow chart of a method for training a diffusion model according to an embodiment of the present application, and as can be seen from fig. 4, a method for training a Unet model is a noise estimation model, and the method for training the Unet model may be to obtain a time step t so as to determine the Unet model of the sub-diffusion model t to be trained, and obtain an audio representation (audio representation vector value) y _t And text representation (text representation vector value), and audio representation (audio representation vector value) y _t And inputting text representation (text representation vector value) into the Unet model of the sub-diffusion model t, carrying out loss calculation on a predicted noise value and real noise output by the Unet model, setting a loss function as a minimum absolute value error, and when the loss value of the Unet model of the sub-diffusion model t meets a convergence condition, exiting training and storing Unet model parameters of the sub-diffusion model t.

It can be understood that training the diffusion model can realize taking random noise as input, taking text characterization as a condition, obtaining audio features (audio characterization) conforming to the text characterization vector through a plurality of times of T reverse processes, and finally obtaining target sound effect audio through an audio decoder, thereby improving the capability of denoising the text characterization output by the text encoder and improving the accuracy of sound effect audio generation of the audio book.

Inputting the text representation into a pre-trained diffusion model, and sequentially denoising the text representation by N sub-diffusion models of the pre-trained diffusion model to obtain N times of denoised text representations output by the diffusion model; after N is an integer greater than or equal to 2, the denoised text can be characterized and input into an audio decoder to obtain target sound effect audio corresponding to the scene of the target text output by the audio decoder.

Specifically, the target sound effect audio can be mixed into the audio book, and the audio book with the sound effect can be generated finally. For example, when the target text of the audio book is "at this time, the user opens the door and walks around", and the door opening sound and the step sound can be automatically generated, and the sound effect audio of the door opening sound and the step sound is the target sound effect audio corresponding to the target text. It can be understood that after the target text (text to be read) of the audio book is input, the sound effect conforming to the text scene can be automatically generated, the immersion and the vividness of the audio book (audio book) are improved, and the reading interest of readers is stimulated.

In this embodiment, the target text of the audio book may be input to a pre-trained text encoder, the scene content of the target text may be extracted to obtain a text token, the text token may be input to a pre-trained diffusion model, and N times of denoising processing may be performed on the text token, to the text token after N times of denoising processing, the text token after denoising processing may be input to an audio decoder, to obtain target audio corresponding to the scene of the target text output by the audio decoder. The method is a generated thought, replaces a matched thought, can extract text tokens, can denoise the text tokens, has higher accuracy of the obtained text tokens, generates target sound effect audio according to the text tokens with higher accuracy, and can avoid the problem of incorrect matching caused by low accuracy of a similarity estimation method of the matched thought. And secondly, after the target text (text to be read) of the audio book is input, the sound effect conforming to the text scene can be automatically generated, the immersion and the vividness of the audio book (audio book) are improved, and the reading interest of readers is stimulated. Then, under the condition of text driving, visual driving can be compatible, video images can be processed through an image description technology (image capttaking) to obtain text input, and the flexibility of sound effect and audio generation of the voice book is high. Furthermore, the sound effect generation system based on the contrast learning and diffusion model can be trained through a large number of text samples and audio samples, so that sound effects conforming to the text scene can be automatically generated according to the text of the voice book, and the accuracy of the model, the richness of the sound effect scene, the high efficiency of the model, the immersion and the vividness are improved. Finally, the embodiment is a generated idea, and the generated audio effect can be obtained only by inputting a pre-trained model, so that each generation result has certain uniqueness, the problem of repeated use of the same material is avoided, and the problems of high technical cost, high error rate, low transmission speed, poor safety and the like of the existing matched idea are solved.

It will be appreciated that in addition to the method of training a text encoder described above; in addition to the method of training a diffusion model described above; in addition to the method described above of calculating a first loss between a predicted text representation corresponding to each sound effect text and an audio representation corresponding to each sound effect audio and/or a second loss between a predicted text representation of each non-sound effect text and an audio representation corresponding to the mute audio according to a regression loss function, when the first loss and/or the second loss meet a convergence condition, obtaining a trained text encoder; in addition to the method of inputting the audio samples into an audio encoder described above; in addition to inputting the audio characterization sample into the diffusion model, sequentially carrying out noise adding processing on the audio characterization sample by using the noise adding modules corresponding to the N sub-diffusion models of the diffusion model to obtain a noise-added audio characterization, sequentially estimating the noise of the noise-added audio characterization by using the noise estimation models corresponding to the N sub-diffusion models according to the relation between the noise-added audio characterization sample and the marked text characterization to obtain a predicted noise value, and sequentially carrying out noise removing processing on the noise-added audio characterization by using the noise removing modules corresponding to the N sub-diffusion models according to the predicted noise value to obtain a denoised predicted audio characterization; in addition to the above-described method of sequentially calculating the losses between the predicted noise values corresponding to each of the N sub-diffusion models and the actual noise values of the labeled N time steps according to the regression loss function, when the losses corresponding to each of the N sub-diffusion models satisfy the convergence condition, obtaining noise estimation models corresponding to each of the N trained sub-diffusion models, so as to obtain the trained diffusion model according to the noise estimation models corresponding to each of the N trained sub-diffusion models; other reasonable methods are also possible, and are not limited in this regard.

The method for generating audio effects of a voice book in the embodiment of the present application is described above, and the audio effect generating device in the embodiment of the present application is described below, referring to fig. 5, an embodiment of the audio effect generating device in the embodiment of the present application includes:

an obtaining unit 501 for obtaining a target text of a vocal book;

the extracting unit 502 is configured to input the target text into a pre-trained text encoder, and extract, by the pre-trained text encoder, scene content of the target text to obtain a text representation output by the text encoder;

a denoising unit 503, configured to input the text token into a pre-trained diffusion model, and perform denoising processing on the text token sequentially by using N sub-diffusion models of the pre-trained diffusion model to obtain N denoised text tokens output by the diffusion model; wherein N is an integer greater than or equal to 2;

and the decoding unit 504 is configured to input the denoised text representation to an audio decoder, and obtain target audio corresponding to a scene of the target text output by the audio decoder.

Referring now to fig. 6, another embodiment of an audio generating apparatus 600 according to an embodiment of the present application includes:

a central processor 601, a memory 605, an input/output interface 604, a wired or wireless network interface 603, and a power supply 602;

memory 605 is a transient memory or a persistent memory;

the central processor 601 is configured to communicate with the memory 605 and to execute the instructions in the memory 605 to perform the method of the embodiment shown in fig. 2 described above.

Embodiments of the present application also provide a computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the method of the embodiment shown in fig. 2 described above.

Embodiments of the present application also provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of the embodiment shown in fig. 2 described above.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (RAM, random access memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Claims

1. A method for generating audio effects of a voice book, comprising:

obtaining a target text of a sound book;

2. The method of claim 1, wherein prior to the entering the target text into a pre-trained text encoder, the method comprises:

3. The method of claim 2, wherein the obtaining a text sample, the text sample being labeled with a scene sound characterization, the method further comprising:

4. A method according to claim 3, wherein the calculating according to a regression loss function a first loss between the predicted text representation for each sound effect text and the audio representation for each sound effect audio and/or a second loss between the predicted text representation for each non-sound effect text and the audio representation for each mute audio results in a trained text encoder when the first loss and/or the second loss meet a convergence condition, comprising:

5. The method of claim 3, wherein prior to the inputting the audio samples into an audio encoder, the method further comprises:

the inputting the audio samples into an audio encoder comprises:

and inputting the preprocessed audio samples into an audio encoder.

6. The method of claim 1, wherein the N sub-diffusion models include a respective corresponding noise adding module, noise estimating model, and denoising module;

the method further comprises the steps of:

7. The method according to claim 6, wherein the inputting the audio representation sample into a diffusion model, sequentially performing noise adding processing on the audio representation sample by a noise adding module corresponding to each of N sub-diffusion models of the diffusion model to obtain a noise-added audio representation, sequentially estimating noise of the noise-added audio representation by a noise estimation model corresponding to each of N sub-diffusion models according to a relation between the noise-added audio representation sample and a labeled text representation to obtain a predicted noise value, and sequentially performing noise removing processing on the noise-added audio representation by a noise removing module corresponding to each of N sub-diffusion models according to the predicted noise value to obtain a noise-removed predicted audio representation, comprising:

8. The method according to claim 7, wherein the sequentially calculating the losses between the predicted noise values corresponding to each of the N sub-diffusion models and the actual noise values of the marked N time steps according to the regression loss function, when the losses corresponding to each of the N sub-diffusion models satisfy the convergence condition, obtaining the noise estimation model corresponding to each of the N sub-diffusion models after training, so as to obtain the diffusion model after training according to the noise estimation model corresponding to each of the N sub-diffusion models after training, includes:

9. An audio generating device, comprising:

a central processing unit and a memory;

the memory is a short-term memory or a persistent memory;

the central processor is configured to communicate with the memory and to execute instruction operations in the memory to perform the method of any of claims 1 to 8.

10. A computer readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1 to 8.