CN113192522B

CN113192522B - Audio synthesis model generation method and device and audio synthesis method and device

Info

Publication number: CN113192522B
Application number: CN202110438286.1A
Authority: CN
Inventors: 张冉; 王晓瑞
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-04-22
Filing date: 2021-04-22
Publication date: 2023-02-21
Anticipated expiration: 2041-04-22
Also published as: CN113192522A

Abstract

The disclosure provides an audio synthesis model generation method and device and an audio synthesis method and device, relates to the technical field of audio processing, and aims to solve the problem that the reality of audio synthesis of singing voice in the related technology is poor. The method comprises the following steps: acquiring the characteristics of the first audio data, the characteristics of the second audio data, the type information and the spectrum information of the sample audio; the method comprises the steps of carrying out feature combination on the basis of features of first audio data and features of second audio data to obtain target features, carrying out type identification and spectrum identification on target audio on the basis of the target features to respectively obtain type information and spectrum information of the target audio, comparing the type information and the spectrum information with the type information and the spectrum information of sample audio to determine first information and second information, and generating an audio synthesis model according to the first information and the second information, so that the truth degree of the target audio generated by the obtained audio synthesis model relative to the sample audio is ensured, and the authenticity of the audio synthesis model for generating the synthesized audio is improved.

Description

Audio synthesis model generation method and device and audio synthesis method and device

Technical Field

The present disclosure relates to the field of audio data processing technologies, and in particular, to a method and an apparatus for generating an audio synthesis model, and a method and an apparatus for audio synthesis.

Background

In recent years, audio synthesis technology and singing voice synthesis in the application thereof are more and more popular, for example, in virtual singing applications and intelligent navigation applications, the audio synthesis technology is widely applied.

In the prior art, the audio synthesis of singing voice is mainly to record a large number of songs in advance, perform label training on the recorded songs, input the recorded songs into a pre-trained synthesis model for training to obtain a plurality of acoustic parameters, and synthesize the audio of the singing voice based on the acoustic parameters.

However, the above-mentioned audio synthesis process of singing voice not only requires to record a large number of songs in advance, but also requires a professional singer to record in a specific environment, resulting in poor reality of audio synthesis of singing voice under the condition that a professional singer without a large number of labels records.

Disclosure of Invention

The present disclosure provides an audio synthesis model generation method and apparatus, and an audio synthesis method and apparatus, so as to at least solve the technical problem of poor authenticity of audio synthesis of singing voice in the related art. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided an audio synthesis model generation method, including: the audio synthesis model generation device acquires the characteristics of the first audio data, the characteristics of the second audio data, the type information and the spectrum information of the sample audio; the sample audio is obtained by synthesizing first audio data and second audio data; the first audio data includes voice audio and voice text, and the second audio data includes singing audio and lyric text; the first audio data and the second audio data are subjected to feature combination based on features of the first audio data and features of the second audio data to obtain target features, and the target features are used for representing features of target audio synthesized by the first audio data and the second audio data; performing type identification and spectrum identification on the target audio based on the target characteristics to respectively obtain type information and spectrum information of the target audio; determining first information according to the type information of the sample audio and the type information of the target audio, and determining second information according to the frequency spectrum information of the sample audio and the frequency spectrum information of the target audio; the first information is used for representing the reverse difference between the type information of the sample audio and the type information of the target audio, and the second information is used for representing the difference between the spectrum information of the sample audio and the spectrum information of the target audio; and generating an audio synthesis model according to the first information and the second information.

Optionally, the obtaining, by the audio synthesis model generating device, characteristics of the first audio data includes: performing phoneme recognition on the first audio data to obtain phoneme characteristics of the first audio data; carrying out fundamental frequency identification on the first audio data to obtain fundamental frequency characteristics of the first audio data; and splicing the phoneme characteristics of the first audio data and the fundamental frequency characteristics of the first audio data to obtain the characteristics of the first audio data.

Optionally, in this embodiment of the present disclosure, the obtaining, by the audio synthesis model generating device, characteristics of the second audio data includes: performing phoneme recognition on the second audio data to obtain phoneme characteristics of the second audio data; carrying out fundamental frequency identification on the second audio data to obtain fundamental frequency characteristics of the second audio data; and splicing the phoneme characteristics of the second audio data and the fundamental frequency characteristics of the second audio data to obtain the characteristics of the second audio data.

Optionally, in this embodiment of the present disclosure, the audio synthesis model generation apparatus updates a parameter of the feature extraction network in the audio synthesis model when the first information is minimum and the second information is minimum.

Optionally, in this embodiment of the present disclosure, the determining, by the audio synthesis model generating apparatus, the first information according to the type information of the sample audio and the type information of the target audio includes: calculating a difference value between the type information of the sample audio and the type information of the target audio according to the type information of the sample audio and the type information of the target audio; and obtaining first information according to the difference value and a back propagation algorithm.

Optionally, in this embodiment of the present disclosure, determining the second information according to the spectrum information of the sample audio and the spectrum information of the target audio includes: and calculating a difference value between the frequency spectrum information of the sample audio and the frequency spectrum information of the target audio according to the frequency spectrum information of the sample audio and the frequency spectrum information of the target audio, wherein the difference value is second information.

According to a second aspect of the embodiments of the present disclosure, there is provided an audio synthesizing method, including: the audio synthesis device acquires the characteristics of the target first audio data and the target second audio data; inputting the characteristics of the target first audio data and the characteristics of the target second audio data into an audio synthesis model to obtain a synthesized audio, wherein the audio synthesis model is obtained by using the audio synthesis model generation method in the first aspect.

According to a third aspect of the embodiments of the present disclosure, an audio synthesis model generation apparatus is provided, which includes an obtaining module, a feature extraction module, a first processing module, a second processing module, and a generation module. An acquisition module configured to perform acquiring a feature of the first audio data, a feature of the second audio data, type information of the sample audio, and spectrum information; the sample audio is obtained by synthesizing first audio data and second audio data; the first audio data includes voice audio and voice text, and the second audio data includes singing audio and lyric text; the feature extraction module is configured to perform feature merging based on features of the first audio data and features of the second audio data to obtain target features, and the target features are used for representing features of a target audio synthesized by the first audio data and the second audio data; the first processing module is configured to perform type identification and spectrum identification on the target audio based on the target characteristics, and obtain type information and spectrum information of the target audio respectively; the second processing module is configured to determine the first information according to the type information of the sample audio and the type information of the target audio, and determine the second information according to the spectrum information of the sample audio and the spectrum information of the target audio; the first information is used for representing the reverse difference between the type information of the sample audio and the type information of the target audio, and the second information is used for representing the difference between the spectrum information of the sample audio and the spectrum information of the target audio; a generating module configured to perform generating an audio synthesis model from the first information and the second information.

Optionally, in this embodiment of the present disclosure, the obtaining module is configured to perform a feature of obtaining the first audio data, and specifically includes: performing phoneme recognition on the first audio data to obtain phoneme characteristics of the first audio data; carrying out fundamental frequency identification on the first audio data to obtain fundamental frequency characteristics of the first audio data; and splicing the phoneme characteristics of the first audio data and the fundamental frequency characteristics of the first audio data to obtain the characteristics of the first audio data.

Optionally, in this embodiment of the present disclosure, the obtaining module is configured to perform the feature of obtaining the second audio data, and specifically includes: performing phoneme recognition on the second audio data to obtain phoneme characteristics of the second audio data; identifying the fundamental frequency of the second audio data to obtain the fundamental frequency characteristic of the second audio data; and splicing the phoneme characteristics of the second audio data and the fundamental frequency characteristics of the second audio data to obtain the characteristics of the second audio data.

Optionally, in this embodiment of the present disclosure, the audio synthesis model generation apparatus further includes an update module, and the update module is configured to update the parameter of the feature extraction network in the audio synthesis model when the first information is minimum and the second information is minimum.

Optionally, in this embodiment of the present disclosure, the second processing module is configured to determine the first information according to the type information of the sample audio and the type information of the target audio, and specifically includes: calculating a difference value between the type information of the sample audio and the type information of the target audio according to the type information of the sample audio and the type information of the target audio; and obtaining first information according to the difference value and a back propagation algorithm.

Optionally, in this embodiment of the present disclosure, the second processing module is configured to determine the second information according to the spectrum information of the sample audio and the spectrum information of the target audio, and specifically includes: and calculating a difference value between the frequency spectrum information of the sample audio and the frequency spectrum information of the target audio according to the frequency spectrum information of the sample audio and the frequency spectrum information of the target audio, wherein the difference value is second information.

According to a fourth aspect of the embodiments of the present disclosure, there is provided the audio synthesizing apparatus including an acquisition unit and a processing unit. An acquisition unit configured to acquire a feature of target first audio data and a feature of target second audio data; a processing unit configured to input the characteristics of the target first audio data and the characteristics of the target second audio data into an audio synthesis model to obtain a synthesized audio, wherein the audio synthesis model is obtained by using the audio synthesis model generation apparatus according to the third aspect.

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute instructions to implement an audio synthesis model generation method as described above in any one of the possible implementations of the first aspect or the first aspect, and/or an audio synthesis method as described above in any one of the possible implementations of the second aspect or the second aspect.

According to a sixth aspect of embodiments of the present disclosure, there is provided a storage medium, in which instructions that, when executed by a processor of an electronic device, enable the electronic device to perform a method for generating an audio synthesis model as described in any one of the possible implementations of the first aspect or the first aspect, and/or a method for audio synthesis as described in any one of the possible implementations of the second aspect or the second aspect.

According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product, in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform a method of generating an audio synthesis model as described in the first aspect or any one of the possible implementations of the first aspect, and/or a method of audio synthesis as described in the second aspect or any one of the possible implementations of the second aspect.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

by the scheme, the characteristics of the first audio data, the characteristics of the second audio data, the type information and the spectrum information of the sample audio are obtained; the sample audio is obtained by synthesizing first audio data and second audio data; the first audio data comprise voice audio and voice texts, the second audio data comprise singing audio and lyric texts, common features of the first audio data and the second audio data are extracted to obtain target features, type identification and spectrum identification are carried out on the target audio based on the target features to respectively obtain type information and spectrum information of the target audio, the type information and the spectrum information can be compared with the type information and the spectrum information of sample audio to determine the first information and the second information, and an audio synthesis model is generated according to the first information and the second information, so that the truth degree of the target audio generated by the obtained audio synthesis model relative to the sample audio is ensured, and the authenticity of the audio synthesis model for generating the synthesized audio is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

Fig. 1 is a hardware schematic diagram of an electronic device provided by an embodiment of the present disclosure.

FIG. 2 is a flowchart illustrating a first method of audio synthesis model generation, according to an example embodiment.

FIG. 3 is a flowchart II illustrating a method for generating an audio synthesis model, according to an exemplary embodiment.

FIG. 4 is a flowchart illustrating a method for generating an audio synthesis model according to an exemplary embodiment.

FIG. 5 is a flowchart illustration four of a method of audio synthesis model generation, according to an exemplary embodiment.

FIG. 6 is a block diagram one of an audio synthesis model generation apparatus shown in accordance with an exemplary embodiment.

Fig. 7 is a block diagram two illustrating an audio synthesis model generation apparatus according to an exemplary embodiment.

FIG. 8 is a block diagram illustrating an audio synthesis apparatus according to an exemplary embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in other sequences than those illustrated or described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The terms/nouns referred to in the embodiments of the present disclosure are explained first below.

Speech corpus (speech corps): is composed of speech and text corresponding to the speech. Generally, the method for generating a speech corpus includes two methods, one is to read known text to obtain speech; the other is to recognize the known speech to obtain the text. It should be understood that the process of obtaining speech from text is text-to-speech (TTS), and the process of obtaining text from speech is speech recognition (ASR). The text conversion into the voice or the voice conversion into the text can be realized on the basis of a machine, can also be realized manually, or can be realized by matching the machine and the human.

Singing corpus (sing corpus): composed of singing voice and text corresponding to the singing voice. Generally, a method of generating a singing voice corpus includes two methods, one is to singing a known text to obtain a singing voice; another is to recognize a known singing voice to obtain text.

In the embodiment of the disclosure, the common features of the first audio data and the second audio data are extracted to perform feature merging to obtain the target feature, the type identification and the spectrum identification are performed on the target audio based on the target feature to obtain the type information and the spectrum information of the target audio respectively, the type information and the spectrum information can be compared with the type information and the spectrum information of the sample audio to determine the first information and the second information, and an audio synthesis model is generated according to the first information and the second information, so that the truth of the target audio generated by the obtained audio synthesis model relative to the sample audio is ensured, and the authenticity of the audio generated by the audio synthesis model for synthesizing the audio is improved.

It should be noted that the audio synthesis method provided by the embodiment of the present disclosure may be applied to the following scenes, which are respectively: singing, chat rooms, live rooms, map navigation, etc. Of course, in actual implementation, the audio synthesis method provided by the embodiment of the present disclosure may also be applied to any other possible scenarios, which may be determined according to actual use requirements, and the embodiment of the present disclosure is not limited.

The execution subject of the audio synthesis model generation method provided by the embodiment of the present disclosure may be the audio synthesis model generation apparatus provided by the embodiment of the present disclosure, or may be an electronic device including the audio synthesis model generation apparatus, which may be determined specifically according to actual use requirements, and the embodiment of the present disclosure is not limited.

Fig. 1 is a hardware schematic diagram of an electronic device provided by an embodiment of the present disclosure. The electronic device 100 includes, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted terminal, a wearable device, a pedometer, and the like. As shown in fig. 1, the electronic device includes a processor 101, a memory 102, a network interface 103, and a bus 104. The processor 101, the memory 102 and the network interface 103 may be connected via a bus 104 or may be connected to each other in other manners.

The processor 101 is a control center of the electronic device, and the processor 101 may be a Central Processing Unit (CPU), other general-purpose processors, or the like, wherein the general-purpose processor may be a microprocessor or any conventional processor. Illustratively, the processor 101 may include one or more CPUs. The CPU is a single-core CPU or a multi-core CPU.

Memory 102 includes, but is not limited to, random Access Memory (RAM), read Only Memory (ROM), erasable programmable read-only memory (EPROM), flash memory, or optical memory, magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

In one possible implementation, the memory 102 may exist independently of the processor 101. Memory 102 may be coupled to processor 101 through bus 104 for storing data, instructions, or program code. The processor 101 can implement the processing method of the speech corpus provided by the embodiment of the present application when calling and executing the instructions or program codes stored in the memory 102.

In another possible implementation, the memory 102 may also be integrated with the processor 101.

The network interface 103 is a wired interface (port), such as a Fiber Distributed Data Interface (FDDI) interface or a Gigabit Ethernet (GE) interface. Alternatively, the network interface 103 is a wireless interface. It should be understood that the network interface 103 includes a plurality of physical ports, and that the network interface 103 may be used to receive or transmit voice.

Optionally, the electronic device further comprises an input/output interface 105, wherein the input/output interface 105 is configured to connect with an input device and receive information input by a user through the input device. Input devices include, but are not limited to, a keyboard, a touch screen, a microphone, and the like. The input/output interface 105 is also used for connecting with an output device, and outputting a processing result (for example, a speech corpus is available or unavailable) of the processor 101. Output devices include, but are not limited to, displays, printers, and the like.

The bus 104 may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an extended ISA (enhanced industry standard architecture) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 1, but it is not intended that there be only one bus or one type of bus.

It is noted that the configuration shown in fig. 1 does not constitute a limitation of the electronic device, which may comprise more or less components than those shown in fig. 1, or a combination of certain components, or a different arrangement of components, in addition to the components shown in fig. 1.

The following takes an audio synthesis model generation apparatus as an example, and with reference to each drawing, an audio synthesis model generation method provided by the embodiment of the present disclosure is exemplarily described.

Fig. 2 is a flowchart illustrating an audio synthesis model generation method according to an exemplary embodiment, the audio synthesis model generation method being used in an audio synthesis model generation apparatus as illustrated in fig. 2, the method including the following steps S21-S25.

In step S21, the audio synthesis model generation means acquires the feature of the first audio data, the feature of the second audio data, the type information of the sample audio, and the spectral information.

Specifically, the sample audio is obtained by synthesizing first audio data and second audio data; the first audio data includes voice audio and voice text, and the second audio data includes singing audio and lyric text.

In embodiments of the present disclosure, the first audio data may be referred to as speech corpus, including speech audio and speech text, and may be recorded speech audio and speech text, for example, for "true weather today! "speech corpus," weather today really good! "is the speech text, and the speech audio is the audio file corresponding to the segment of words. The second audio data may be referred to as singing corpus, including singing audio and lyrics text, and may be recorded singing audio and lyrics text, for example, for "true weather today! "singing language material," today's weather is really good! "is the text of the lyrics, and the singing audio is the audio file corresponding to the singing voice singing the text. It should be noted that a text of a speech corpus or a singing corpus may only contain words, such as "i are willing to be you", "are you like", "are good at you like", "new year like" etc., and may also include one or more of symbols, numbers, etc., such as "you clap one, i clap one", "one thousand and one night", "cheering 8230; \8230;" etc. In addition, the first audio data and the second audio data may be from the same user or from different users, and may be determined according to actual situations.

In step S22, the audio synthesis model generation device performs feature merging based on the features of the first audio data and the features of the second audio data to obtain a target feature.

Specifically, the target feature is used to characterize a target audio synthesized from the first audio data and the second audio data.

In the embodiment of the present disclosure, the audio synthesis model generation apparatus may extract common features of the first audio data and the second audio data to perform feature merging, so as to obtain a target feature, where the target feature is used to characterize a feature of a target audio synthesized by the first audio data and the second audio data, and may be a sound feature and/or an audio signal feature. The sound features may include fundamental frequencies, syllables, phonemes, and the like, the audio signal features may include time-domain characteristics, which may include short-term energy, short-term amplitude, short-term zero-crossing rate, and the like, and/or frequency-domain characteristics, which may include audio signal classification, frequency spectrum, power spectral density, energy spectral density, and the like.

Illustratively, in conjunction with fig. 2, as shown in fig. 3 and 4, the feature of acquiring the first audio data in step S21 may be specifically realized by steps S201 to S203 described below, and the feature of acquiring the second audio data in step S21 may be specifically realized by steps S204 to S206 described below.

In step S201, the audio synthesis model generation device performs phoneme recognition on the first audio data to obtain a phoneme feature of the first audio data.

In step S202, the audio synthesis model generation device performs fundamental frequency identification on the first audio data to obtain fundamental frequency features of the first audio data.

In step S203, the audio synthesis model generating device concatenates the phoneme feature of the first audio data and the fundamental frequency feature of the first audio data to obtain the feature of the first audio data.

In step S204, the audio synthesis model generation device performs phoneme recognition on the second audio data to obtain a phoneme feature of the second audio data.

In step S205, the audio synthesis model generating device performs fundamental frequency identification on the second audio data to obtain fundamental frequency features of the second audio data.

In step S206, the audio synthesis model generation device concatenates the phoneme feature of the second audio data and the fundamental frequency feature of the second audio data to obtain the feature of the second audio data.

In this embodiment of the disclosure, the first audio data and the second audio data may be respectively preprocessed to respectively obtain a voice audio of a voice corpus and a corresponding voice text, and a singing audio of the singing corpus and a corresponding lyric text, and a phoneme sequence is extracted from the voice text and the lyric text through text analysis, and then the phoneme sequence is converted into a feature vector of a phoneme. And performing framing and windowing processing on the voice and the singing voice, performing fundamental frequency identification to obtain fundamental frequency information, and converting the fundamental frequency information into a characteristic vector of the fundamental frequency. The specific processing manner of the phoneme recognition and the fundamental frequency recognition is not limited herein.

For example, a phoneme sequence P = (j in t ian t ian q i zh en h ao) of "today's weather is really good" is extracted, fundamental frequency information M = midi number 69 is obtained, and features identified in the first audio data and the second audio data may be feature-spliced by using a one hot coding (one hot), a Target encoding (Target encoding), a Leave-one-out (Leave-one-out) encoding and the like as input of the feature extraction network.

In the embodiment of the disclosure, the first audio data and the second audio data are preprocessed, so that the feature extraction network can extract the target feature conveniently.

It should be noted that the embodiment of the present disclosure may not limit the execution sequence between steps S201 and S202, and between step S203 and S204, may not limit the execution sequence between step S201 and step S203 and step S204, respectively, and may not limit the execution sequence between step S203 and step S201 and step S202, respectively. For example, step S201 may be performed first, and then step S202 may be performed; step S202 may be executed first, and then step S201 may be executed; step S201 and step S202 may also be executed at the same time, which may be determined according to actual usage requirements.

In the embodiment of the disclosure, the phoneme recognition and the fundamental frequency recognition are respectively performed on the first audio data and the second audio data to respectively obtain the common features of the first audio data and the second audio data, namely the phoneme feature and the fundamental frequency feature, and since the target feature is obtained by splicing the common features of the first audio data and the second audio data and can represent the common features of the target audio synthesized by the first audio data and the second audio data, the comparison of the terrain type information and the spectrum information between the target audio and the sample audio is facilitated.

In step S23, the audio synthesis model generation device performs type recognition and spectrum recognition on the target audio based on the target feature, and obtains type information and spectrum information of the target audio, respectively.

In the embodiment of the present disclosure, the target characteristic input type discrimination network may perform type recognition on the target audio, and simultaneously the target characteristic input spectrum discrimination network performs spectrum recognition on the target audio, to obtain type information and spectrum information of the target audio, respectively, where the type information may be used for domain labeling to label a source speech corpus or a singing corpus of the target audio, where the speech corpus is used as a source domain, and the singing corpus may be used as a target domain; the spectral information may be a mel-frequency spectrum used for decompiling to obtain the target audio.

In step S24, the audio synthesis model generation means determines the first information from the type information of the sample audio and the type information of the target audio, and determines the second information from the spectrum information of the sample audio and the spectrum information of the target audio.

Specifically, the first information is used to represent a reverse difference between type information of the sample audio and type information of the target audio, and the second information is used to represent a difference between spectral information of the sample audio and spectral information of the target audio.

In the embodiment of the present disclosure, in combination with fig. 2 described above, as shown in fig. 5, "determine the first information according to the type information of the sample audio and the type information of the target audio" in step S24 described above may be specifically implemented by steps S211 to S212 described below.

In step S211, the audio synthesis model generation means calculates a difference value between the type information of the sample audio and the type information of the target audio, based on the type information of the sample audio and the type information of the target audio.

In step S212, the audio synthesis model generating device obtains the first information according to the disparity value and the back propagation algorithm.

Illustratively, E represents a feature extraction network, E (xi) represents a feature of a target audio xi extracted by the feature extraction network, D represents a type discrimination network, and the output of D (E (xi)), h (D (E (xi))) represents a probability that the target audio xi is from a source domain, and 1-h (D (E (xi))) represents a probability that an image is from a target domain, and a difference value of a domain label of the target audio relative to a domain label of a sample audio is obtained, and can be calculated by formula (1):

wherein L is _d Represents a difference value function, xs represents a sample set of a source domain, and Xt represents a sample set of a target domain.

In the embodiment of the present disclosure, the difference value is obtained by calculating according to the above formula (1), and the difference value may be subjected to negation processing according to a back propagation algorithm, for example, the difference value may be subjected to negation processing by using a coefficient "- λ" through a gradient inversion layer, so as to obtain the first information, and determine the first information. Therefore, the feature extraction network and the type discrimination network can carry out counterstudy, the type discrimination network can distinguish the audio frequency of the voice corpus and the audio frequency of the singing corpus as much as possible, the feature extraction network extracts the domain invariant features, the type discrimination network is confused, the type discrimination network carries out misjudgment, namely, the type discrimination network cannot distinguish whether the target audio frequency is from the voice corpus or the singing corpus.

Of course, in actual implementation, the above coefficients may also include any other possible coefficients, which may be determined according to actual usage requirements, and the embodiment of the present disclosure is not limited.

In the embodiment of the disclosure, the first information is obtained by calculating the difference between the sample audio and the target audio type information and performing counterstudy through a back propagation algorithm, and the first information can confuse the type discrimination of the target audio, so that the types of the target audio and the sample audio are closer.

In this embodiment of the disclosure, with reference to fig. 2, in the step S24, "determine the second information according to the spectrum information of the sample audio and the spectrum information of the target audio", specifically, a difference value between the spectrum information of the sample audio and the spectrum information of the target audio may be calculated according to the spectrum information of the sample audio and the spectrum information of the target audio, where the difference value is the second information.

Illustratively, E represents a feature extraction network, E (xi) represents a feature of a target audio xi extracted by the feature extraction network, Y represents a spectrum discrimination network, and K classes of mel spectra are preset, so that the probability of converting the output of the spectrum discrimination network into the target audio into each of the classes of mel spectra is: p _k (x _i ) Then, after calculating the probability that the target audio is in each category, the mel spectrum with the maximum prediction probability of the target audio xi can be obtained through the formula (2) as follows:

wherein, the first and the second end of the pipe are connected with each other,

the mel-frequency spectrum prediction label of the target audio xi. For a sample audio with a mel-frequency spectrum label, obtaining a difference between the mel-frequency spectrum of the target audio relative to the mel-frequency spectrum of the sample audio can be calculated by formula (3) and formula (4):

h (P (xi), yi) = -yilog (P (x i)) - (1-yi) log (1-P (xi)) formula (4)

Wherein L is _y And (Xs, ys) represents the distribution of the voice corpora and the Mel frequency spectrums in the source domain, H represents a cross entropy function, P (xi) represents the probability that the target audio is the respective Mel frequency spectrums, xi represents the voice corpora in the source domain, yi represents the class labels of the Mel frequency spectrums, and H represents the cross entropy function.

In the embodiment of the present disclosure, the difference value is obtained by calculating according to the above formula (2), so as to obtain the second information. Therefore, the feature extraction network and the frequency spectrum discrimination network can carry out difference learning, the frequency spectrum discrimination network can distinguish the audio of the voice corpus and the audio of the singing corpus as much as possible, the feature extraction network can extract the frequency spectrum difference between the two, and the frequency spectrum information of the target audio and the frequency spectrum information of the sample audio can be distinguished.

In the embodiment of the disclosure, the second information is obtained by calculating the difference between the sample audio frequency and the target audio frequency spectrum information and performing counterstudy through a back propagation algorithm, and the second information can enable the discrimination of the target audio frequency spectrum information to be more accurate.

In step S25, the audio synthesis model generation means generates an audio synthesis model from the first information and the second information.

In the embodiment of the present disclosure, after the audio synthesis model generation apparatus inputs the target audio into the type discrimination network and the spectrum discrimination network, the first information and the second information may be obtained. If the first information is less than or equal to a first preset threshold and the second information is less than or equal to a second preset threshold, the audio synthesis model generation device may determine that the current feature extraction network meets the condition. If the first information is greater than the preset threshold value, or the second information is greater than the preset threshold value, the audio synthesis model generation device may update the parameters of the feature extraction network, and input the first audio data and the second audio data into the feature extraction network after updating the parameters again until the first information of the target audio generated by the feature extraction network relative to the sample audio is less than or equal to the first preset threshold value, and the second information is less than or equal to the second preset threshold value.

In the embodiment of the present disclosure, the audio synthesis model generation means updates the parameter of the feature extraction network in the audio synthesis model in the case where the first information is minimum and the second information is minimum.

For example, the optimization goal of the feature extraction network E is to minimize the first information function, and the optimization goal of the type discrimination network D is to maximize the difference value function, as shown in equation (5):

wherein, theta _E Parameters of the feature extraction network, θ _D A parameter indicative of a type discrimination network,

parameter theta representing an optimized feature extraction network _E And a parameter theta of the type discriminating network _D And optimizing the feature extraction network to minimize the first information, and optimizing the type discrimination network to maximize the first information.

The optimization goal of the spectrum discrimination network Y is to minimize the second information function, as shown in equation (6):

wherein, theta _E Parameters of the feature extraction network, θ _y A parameter indicative of a spectrum discrimination network,

parameter theta representing an optimized feature extraction network _E And the parameter theta of the spectrum discrimination network _y So that the second information is minimized.

As can be seen from the above, the minimum value of the first information and the minimum value of the second information can be obtained through the above formulas (5) and (6), and the parameters of the feature extraction network are updated, so that the audio synthesis model can be optimized, and the synthesized audio obtained by the audio synthesis model is more real.

Optionally, in this embodiment of the present disclosure, the audio synthesis model generating device may include a feature extraction network, a type discrimination network, and a spectrum discrimination network, where the feature extraction network, the type discrimination network, and the spectrum discrimination network may be a Convolutional Neural Network (CNN) or a Deep Neural Network (DNN).

Optionally, in the training process of the feature extraction network, the audio synthesis model generation device may update the parameters of the generated feature extraction network, and then update the parameters of the type discrimination network and the spectrum discrimination network; or the parameters of the type discrimination network and the spectrum discrimination network can be updated first, and then the parameters of the feature extraction network can be updated. The method can be determined according to actual use requirements, and the embodiment of the disclosure is not limited.

In the embodiment of the present disclosure, it is required to ensure that the target audio is similar to the sample audio, and therefore, in the training process, parameters of the type discrimination network and the spectrum discrimination network may be fixed, and only parameters of the feature extraction network are optimized, so that the first information is minimum and the second information is minimum.

In the embodiment of the present disclosure, since the objective of the type discrimination network is to minimize the loss of the domain labeling, in the training process, the parameters of the feature extraction network may be fixed, and only the parameters of the type discrimination network are optimized, so that the loss of the domain labeling is minimized. The aim of the spectrum discrimination network is to minimize the loss of spectrum information, so that in the training process, the parameters of the feature extraction network can be fixed, and only the parameters of the spectrum discrimination network are optimized, so that the loss of spectrum information is minimized.

That is, in the training process, the audio synthesis model generation device may update the parameters of the feature extraction network, the type discrimination network, and the spectrum discrimination network, respectively, so as to obtain an audio synthesis model that can generate a relatively true audio synthesis model.

In the embodiment of the present disclosure, since the audio synthesis model is obtained by alternately optimizing the feature extraction network, the type discrimination network, and the spectrum discrimination network, the synthesized audio generated by the audio synthesis model obtained by training in the embodiment of the present disclosure is clear and real.

It should be noted that, in the embodiments of the present disclosure, the audio synthesis model generation methods shown in the above drawings are all exemplarily described with reference to one drawing in the embodiments of the present disclosure. In specific implementation, the audio synthesis model generation method shown in each of the above figures may also be implemented in combination with any other combinable figure shown in the above embodiments, and details are not described here again.

FIG. 6 is a block diagram illustrating an audio synthesis model generation apparatus according to an example embodiment. Referring to fig. 6, the audio synthesis model generation apparatus 60 includes an acquisition module 61, a feature extraction module 62, a first processing module 63, a second processing module 64, and a generation module 65. An obtaining module 61 configured to perform obtaining the characteristics of the first audio data, the characteristics of the second audio data, the type information of the sample audio, and the spectrum information; the sample audio is obtained by synthesizing first audio data and second audio data; the first audio data includes voice audio and voice text, and the second audio data includes singing audio and lyric text; the feature extraction module 62 is configured to perform feature merging based on features of the first audio data and features of the second audio data to obtain a target feature, where the target feature is used to characterize a feature of a target audio synthesized by the first audio data and the second audio data, and the first processing module 63 is configured to perform type identification and spectrum identification of the target audio based on the target feature to obtain type information and spectrum information of the target audio, respectively; the second processing module 64 is configured to perform determining the first information according to the type information of the sample audio and the type information of the target audio, and determining the second information according to the spectrum information of the sample audio and the spectrum information of the target audio; the first information is used for representing the reverse difference between the type information of the sample audio and the type information of the target audio, and the second information is used for representing the difference between the spectrum information of the sample audio and the spectrum information of the target audio; a generating module 65 configured to perform generating an audio synthesis model from the first information and the second information.

Optionally, in this embodiment of the present disclosure, the obtaining module 61 is configured to perform the feature of obtaining the first audio data, and specifically includes: performing phoneme recognition on the first audio data to obtain phoneme characteristics of the first audio data; carrying out fundamental frequency identification on the first audio data to obtain fundamental frequency characteristics of the first audio data; and splicing the phoneme characteristics of the first audio data and the fundamental frequency characteristics of the first audio data to obtain the characteristics of the first audio data.

Optionally, in this embodiment of the present disclosure, the obtaining module 61 is configured to perform the feature of obtaining the second audio data, and specifically includes: performing phoneme recognition on the second audio data to obtain phoneme characteristics of the second audio data; carrying out fundamental frequency identification on the second audio data to obtain fundamental frequency characteristics of the second audio data; and splicing the phoneme characteristics of the second audio data and the fundamental frequency characteristics of the second audio data to obtain the characteristics of the second audio data.

Optionally, in this embodiment of the present disclosure, the second processing module 64 is configured to determine the first information according to the type information of the sample audio and the type information of the target audio, and specifically includes: calculating a difference value between the type information of the sample audio and the type information of the target audio according to the type information of the sample audio and the type information of the target audio; and obtaining first information according to the difference value and a back propagation algorithm.

Optionally, in this embodiment of the present disclosure, the second processing module 64 is configured to determine the second information according to the spectrum information of the sample audio and the spectrum information of the target audio, and specifically includes: and calculating a difference value between the frequency spectrum information of the sample audio and the frequency spectrum information of the target audio according to the frequency spectrum information of the sample audio and the frequency spectrum information of the target audio, wherein the difference value is second information.

Optionally, in this embodiment of the present disclosure, referring to fig. 7, an updating module 66 is further included, where the updating module 66 is configured to perform updating of parameters of the feature extraction network in the audio synthesis model in a case of the minimum value of the first information and the minimum value of the second information.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

In the embodiment of the disclosure, the characteristics of the first audio data, the characteristics of the second audio data, the type information and the spectrum information of the sample audio are obtained; the method comprises the steps of carrying out feature combination on the basis of the features of first audio data and the features of second audio data to obtain target features, carrying out type identification and spectrum identification on target audio on the basis of the target features to respectively obtain type information and spectrum information of the target audio, further determining the first information and the second information, and generating an audio synthesis model according to the first information and the second information, so that the truth degree of the target audio generated by the obtained audio synthesis model relative to sample audio is ensured, and the authenticity of the audio synthesis model for generating synthetic audio is improved.

Fig. 7 is a block diagram illustrating an audio synthesis apparatus according to an example embodiment. Referring to fig. 8, the audio synthesizing apparatus 80 includes an acquisition unit 81 and a processing unit 82. An acquisition unit 81 configured to acquire a feature of the target first audio data and a feature of the target second audio data; the processing unit 82 is configured to input the characteristics of the target first audio data and the characteristics of the target second audio data into an audio synthesis model to obtain a synthesized audio, where the audio synthesis model is obtained by using the audio synthesis model generation apparatus described in fig. 6 to 7.

The embodiment of the present disclosure provides an audio synthesis apparatus, wherein the audio synthesis model extracts the common features of the first audio data and the second audio data, so that the reality of the synthesized audio generated by the audio synthesis model is high, and thus, inputting the first audio data and the second audio data into the audio synthesis model can obtain a relatively real synthesized audio.

Another embodiment of the present application further provides a computer-readable storage medium, in which computer instructions are stored, which, when run on an audio synthesis model generation apparatus, cause an audio synthesis apparatus to execute the steps performed by the audio synthesis method in the method flow shown in the above-mentioned method embodiment, or, when run on an audio synthesis apparatus, cause the audio synthesis apparatus to execute the steps performed by the audio synthesis method in the method flow shown in the above-mentioned method embodiment.

Another embodiment of the present application further provides a chip system, which is applied to the audio synthesis model training. The chip system includes one or more interface circuits, and one or more processors. The interface circuit and the processor are interconnected by a line. The interface circuit is configured to receive signals from the memory of the audio synthesis apparatus and to send signals to the processor, the signals including computer instructions stored in the memory. When the processor executes the computer instructions, the audio synthesis model generation apparatus performs the steps performed by the audio synthesis model generation method in the method flow shown in the above-described method embodiment. Or, in the case that the chip system is applied to an audio synthesis apparatus, the audio synthesis apparatus executes each step executed by the audio synthesis method in the method flow shown in the above method embodiment.

In another embodiment of the present application, there is also provided a computer program product, which includes computer instructions for causing an audio synthesis apparatus to perform the steps performed by the audio synthesis method in the method flow shown in the above-mentioned method embodiment when the computer instructions are run on an audio synthesis model generation apparatus, or causing an audio synthesis apparatus to perform the steps performed by the audio synthesis method in the method flow shown in the above-mentioned method embodiment when the computer instructions are run on the audio synthesis apparatus.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented using a software program, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The processes or functions according to the embodiments of the present application are generated in whole or in part when the computer-executable instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). Computer-readable storage media can be any available media that can be accessed by a computer or data storage device including one or more available media integrated servers, data centers, and the like. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), among others.

The foregoing is only illustrative of the present application. Those skilled in the art should appreciate that changes and substitutions can be made in the embodiments provided herein without departing from the scope of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice in the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for generating an audio synthesis model, comprising:

acquiring the characteristics of the first audio data, the characteristics of the second audio data, the type information and the spectrum information of the sample audio; the sample audio is obtained by synthesizing the first audio data and the second audio data; the first audio data comprises voice audio and voice text, and the second audio data comprises singing audio and lyric text;

combining the characteristics of the first audio data and the characteristics of the second audio data to obtain target characteristics, wherein the target characteristics are used for representing the characteristics of a target audio synthesized by the first audio data and the second audio data;

performing type identification and spectrum identification on the target audio based on the target characteristics to respectively obtain type information and spectrum information of the target audio;

determining first information according to the type information of the sample audio and the type information of the target audio, and determining second information according to the frequency spectrum information of the sample audio and the frequency spectrum information of the target audio; the second information is used for representing the difference between the spectral information of the sample audio and the spectral information of the target audio; the determining first information according to the type information of the sample audio and the type information of the target audio includes: calculating a difference value between the type information of the sample audio and the type information of the target audio according to the type information of the sample audio and the type information of the target audio; obtaining the first information according to the difference value and a back propagation algorithm;

and generating an audio synthesis model according to the first information and the second information.

2. The method of claim 1, wherein obtaining the characteristic of the first audio data comprises:

performing phoneme recognition on the first audio data to obtain phoneme characteristics of the first audio data;

performing fundamental frequency identification on the first audio data to obtain fundamental frequency characteristics of the first audio data;

and splicing the phoneme characteristics of the first audio data and the fundamental frequency characteristics of the first audio data to obtain the characteristics of the first audio data.

3. The method of claim 1, wherein obtaining the characteristics of the second audio data comprises:

performing phoneme recognition on the second audio data to obtain phoneme characteristics of the second audio data;

performing fundamental frequency identification on the second audio data to obtain fundamental frequency characteristics of the second audio data;

and splicing the phoneme characteristics of the second audio data and the fundamental frequency characteristics of the second audio data to obtain the characteristics of the second audio data.

4. The method of claim 1, further comprising:

updating parameters of a feature extraction network in the audio synthesis model if the first information is minimal and the second information is minimal.

5. The method of claim 1, wherein determining second information according to the spectral information of the sample audio and the spectral information of the target audio comprises:

and calculating a difference value between the frequency spectrum information of the sample audio and the frequency spectrum information of the target audio according to the frequency spectrum information of the sample audio and the frequency spectrum information of the target audio, wherein the difference value is the second information.

6. An audio synthesis method, comprising:

acquiring the characteristics of target first audio data and the characteristics of target second audio data;

inputting the characteristics of the target first audio data and the characteristics of the target second audio data into an audio synthesis model to obtain a synthesized audio; wherein the audio synthesis model is a model obtained by the audio synthesis model generation method according to any one of claims 1 to 5.

7. An audio synthesis model generation apparatus, comprising:

an acquisition module configured to perform acquisition of a feature of the first audio data, a feature of the second audio data, type information and spectrum information of the sample audio; the sample audio is obtained by synthesizing the first audio data and the second audio data; the first audio data comprises voice audio and voice text, and the second audio data comprises singing audio and lyric text;

the feature extraction module is configured to perform feature merging based on features of the first audio data and features of the second audio data to obtain target features, and the target features are used for representing features of target audio synthesized by the first audio data and the second audio data;

the first processing module is configured to perform type identification and spectrum identification on the target audio based on the target features, and obtain type information and spectrum information of the target audio respectively;

a second processing module configured to determine first information according to the type information of the sample audio and the type information of the target audio, and determine second information according to the spectrum information of the sample audio and the spectrum information of the target audio; the second information is used for characterizing the difference between the spectral information of the sample audio and the spectral information of the target audio; the second processing module is configured to determine the first information according to the type information of the sample audio and the type information of the target audio, and specifically includes: calculating a difference value between the type information of the sample audio and the type information of the target audio according to the type information of the sample audio and the type information of the target audio; obtaining the first information according to the difference value and a back propagation algorithm;

a generating module configured to perform generating an audio synthesis model from the first information and the second information.

8. The apparatus of claim 7, wherein the obtaining module is configured to perform obtaining the characteristic of the first audio data, and in particular comprises:

9. The apparatus of claim 7, wherein the obtaining module is configured to perform obtaining the feature of the second audio data, and specifically comprises:

10. The apparatus of claim 7, further comprising an update module,

the update module is configured to perform updating parameters of a feature extraction network in the audio synthesis model if the first information is minimal and the second information is minimal.

11. The apparatus according to claim 7, wherein the second processing module is configured to perform the determining the second information according to the spectral information of the sample audio and the spectral information of the target audio, and specifically includes:

12. An audio synthesizing apparatus, comprising:

an acquisition unit configured to perform acquisition of a feature of target first audio data and a feature of target second audio data;

a processing unit configured to perform inputting the characteristics of the target first audio data and the characteristics of the target second audio data into an audio synthesis model to obtain a synthesized audio;

wherein the audio synthesis model is a model obtained by using the audio synthesis model generation apparatus according to any one of claims 7 to 11.

13. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of any one of claims 1 to 6.

14. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of an electronic device, cause the electronic device to perform the method of any of claims 1-6.