CN114974218A

CN114974218A - Voice conversion model training method and device and voice conversion method and device

Info

Publication number: CN114974218A
Application number: CN202210554179.XA
Authority: CN
Inventors: 盛乐园
Original assignee: Hangzhou Xiaoying Innovation Technology Co ltd
Current assignee: Hangzhou Xiaoying Innovation Technology Co ltd
Priority date: 2022-05-20
Filing date: 2022-05-20
Publication date: 2022-08-30

Abstract

The invention relates to a voice conversion model training method and a device thereof, and a voice conversion method and a device thereof in the field of voice conversion, wherein the model training method comprises the following steps: acquiring first voice and text data with the same content as the first voice, and calculating a first content characteristic by using the text data; extracting the spectral feature of the first voice, outputting the first spectral feature, and calculating the first speaker feature and the first hidden variable through the first spectral feature; and calculating and outputting a second speaking characteristic by taking the first speaker characteristic as a condition on the input stream model of the first implicit variable and the first speaker characteristic, calculating a loss function by taking the second speaking characteristic and the first content characteristic, extracting the first implicit variable reaching a preset optimization parameter, and inputting the optimized first implicit variable into a decoder to obtain predicted speech. The technology of the invention well keeps the information of tone and the like of the speaker.

Description

Voice conversion model training method and device and voice conversion method and device

Technical Field

The present invention relates to the field of voice conversion, and in particular, to a method and an apparatus for training a voice conversion model, and a method and an apparatus for voice conversion.

Background

Due to the development of deep learning and application in various fields, voice conversion also obtains a lot of benefits. The voice conversion is to convert the tone of the voice, and the aim is to only change the tone of the speaker, the content, emotion, tone, speed and the like of the speaker to be consistent with the original voice frequency. Examples are: there are two speakers A and B, where A speaks a speech (S), and the function of speech conversion is to convert the tone of the speech (S) into the sound of B, and the rest of the content remains unchanged. The data set used according to the training can be divided into: 1. and 2, voice conversion based on the parallel language materials. Parallel corpora refer to: for each sentence (S1) in the data set, there is another sentence (S2), and the difference between them is only that the speaker' S timbre is inconsistent, and other information, such as content, emotion, mood, speed, etc., is the same. Since such parallel corpora are difficult to obtain, the current research focus is on speech conversion of non-parallel corpora.

Content coding, which is a conventional speech conversion technology, first extracts and contrasts predictive coding from source audio by means of speech recognition. Compared with predictive coding, predictive coding generally does not contain information such as speakers, tone, intonation, and the like in audio, and more usually contains content information.

Speaker coding, which is a technique for extracting a speaker from audio, generally extracts a speaker vector by using a deep learning technique.

And feature decoding, wherein the feature decoding is to perform feature fusion on the content coding and the speaker coding through a deep learning network and calculate loss with a Mel frequency spectrum extracted from real voice.

A vocoder, taking as input the mel spectrum extracted from real speech, using neural network models such as: WaveNet, Parallel WaveNet, Hifi-Gan, etc. to predict the true speech waveform. The input at the inference stage is the mel spectrum after the source audio is converted, and not the true mel spectrum as input.

The existing technical circuit is as follows: 1. the content coding is recognized by the framework of speech recognition. 2. Speaker vectors are extracted from the pre-trained model. In the training phase, 1 and 2 are decoded to obtain the Mel frequency spectrum of the source audio. The speaker vector in 2 is replaced with the speaker vector of the target speaker during the inference phase. The disadvantages are that the content recognition depends on the model of speech recognition, and the converted audio only can retain the content information, and tone and the like cannot be converted.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a framework which does not need to rely on voice recognition to code the content and retains information such as tone and the like.

In order to solve the technical problem, the invention is solved by the following technical scheme:

a speech conversion model training method comprises the following steps: acquiring first voice and text data with the same content as the first voice, and calculating a first content characteristic by using the text data;

extracting the spectral feature of the first voice, outputting the first spectral feature, and calculating the first speaker feature and the first hidden variable through the first spectral feature;

and calculating and outputting a second speaking characteristic by taking the first speaker characteristic as a condition on the input stream model of the first implicit variable and the first speaker characteristic, calculating a loss function by taking the second speaking characteristic and the first content characteristic, extracting the first implicit variable reaching a preset optimization parameter, and inputting the optimized first implicit variable into a decoder to obtain predicted speech.

Preferably, the specific method for extracting the spectral feature of the first speech, outputting the first spectral feature, and calculating the first speaker feature and the first hidden variable according to the first spectral feature includes:

and calculating a first hidden variable according to the first spectrum characteristic by adopting an a posteriori encoder, wherein the a posteriori encoder comprises a plurality of WaveNet residual error models.

and calculating first speaker characteristics according to the first spectrum characteristics by adopting a speaker encoder, wherein the speaker encoder comprises a converter model.

Preferably, the flow model comprises a plurality of WaveNet residual blocks for constructing the mapping relation between the content characteristics and the hidden variables,

the content features are converted into hidden variables through a flow model, and the hidden variables are converted into the content features through the flow model.

Preferably, the method for calculating the first content feature from the text data includes:

and obtaining phonemes corresponding to the text from the text data through a font, representing the phonemes of the text, and coding the characterized characteristics by a CBHG module to obtain a first content characteristic.

The invention also discloses a voice conversion method, which comprises a flow model obtained by training according to the voice conversion model training method, and also comprises the following steps:

acquiring a first audio characteristic P1 unrelated to the source audio speaker information;

acquiring the voice of the target speaker to be converted, extracting the spectral feature of the voice of the target speaker, outputting a second spectral feature, and calculating a second speaking feature according to the second spectral feature S2;

and inputting the second speaking characteristic and the first audio characteristic into a stream model to obtain a second hidden variable Z2, and decoding the second hidden variable to generate a target audio.

A method of obtaining a first audio characteristic independent of source audio speaker information includes transcoding using a content encoder.

Preferably, the speaker information includes a speaker tone.

Preferably, the method of obtaining the first audio characteristic independent of the source audio speaker information comprises:

and inputting the first hidden variable and the second speaker characteristic into the flow model, and calculating and outputting a second speaker characteristic P1 by taking the second speaker characteristic as a condition.

The invention also provides a speech conversion model training device, comprising: the main controller is used for acquiring first voice, calculating first spectrum characteristics, calculating text data which is the same as the first voice, and controlling the input and the output of the data among the content encoder, the posterior encoder, the stream model unit and the decoder;

the content encoder is used for acquiring text data which is the same as the first voice content and calculating a first content characteristic according to the text data;

the posterior encoder receives the first spectral feature and calculates a first hidden variable through the first spectral feature;

the speaker encoder receives the first spectral characteristics and calculates the first speaker characteristics according to the first spectral characteristics;

the flow model unit is used for receiving a first hidden variable and the first speaker characteristic, calculating and outputting a second speaker characteristic by taking the first speaker characteristic as a condition, calculating a loss function by taking the second speaker characteristic and the first content characteristic, and extracting the first hidden variable reaching a preset optimization parameter;

and the decoder is used for inputting the optimized first hidden variable into the decoder to obtain the predicted voice.

The present invention also provides a voice conversion apparatus, comprising:

a speech conversion model training apparatus comprising:

the content encoder is used for encoding the text content of the voice of the source speaker through a deep learning model to obtain a first audio characteristic irrelevant to the information of the source speaker;

the speaker encoder is used for receiving the target speaker voice to be converted, extracting the frequency spectrum characteristic of the target speaker voice, outputting a second frequency spectrum characteristic, and calculating the second speaking characteristic according to the second frequency spectrum characteristic;

the stream model unit receives a second speaking characteristic and the first audio characteristic and outputs a second hidden variable by taking the second speaking characteristic as a condition;

and the decoder receives the second hidden variable and outputs the target audio.

The invention has the beneficial effects that:

the invention avoids the defects of the prior art, does not need to rely on a frame of speech recognition to code the content, and can also keep information such as tone and intonation. Because the structure of the invention can well decouple the characteristics of the speaker and the non-speaker in the frequency spectrum, the converted audio frequency reserves other information except the tone color of the speaker.

In addition, the invention is a universal voice conversion technology, which can well convert the source audio of any language. The corresponding relation of the frame level is more precise than the corresponding relation of the phoneme level by extracting the frequency spectrum characteristics, so that even if the training set only comprises a Chinese data set, the English, Japanese, Korean, Western, dialect and the like can be well converted.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flowchart of a speech conversion model training method according to embodiment 1;

fig. 2 is a flowchart of a speech conversion method of embodiment 2.

Detailed Description

The present invention will be described in further detail with reference to examples, which are illustrative of the present invention and are not to be construed as being limited thereto.

Example 1:

a method for training a speech conversion model is provided,

acquiring first voice and text data with the same content as the first voice, and calculating a first content characteristic C by using the text data;

extracting the spectral feature of the first voice, outputting the first spectral feature, and calculating a first speaker feature S1 and a first hidden variable Z1 according to the first spectral feature;

and calculating and outputting a second speaking characteristic P1 by taking the first speaker characteristic as a condition on the basis of the first speaker characteristic as an input stream model, calculating a loss function by taking the second speaking characteristic and the first content characteristic, extracting the first hidden variable reaching a preset optimization parameter, and inputting the optimized first hidden variable into a decoder to obtain predicted speech.

The specific method for extracting the spectral feature of the first voice, outputting the first spectral feature, and calculating the first speaker feature S1 and the first hidden variable Z1 through the first spectral feature comprises the following steps:

and calculating a first hidden variable Z1 by adopting an a posteriori encoder according to the first spectrum characteristic, wherein the a posteriori encoder comprises a plurality of WaveNet residual models.

The specific method for extracting the spectral feature of the first voice, outputting the first spectral feature and calculating the first speaker feature S1 and the first hidden variable Z1 through the first spectral feature comprises the following steps:

The flow model comprises a plurality of WaveNet residual blocks and is used for constructing a mapping relation between content characteristics and hidden variables, the content characteristics are converted into the hidden variables through the flow model, and the hidden variables are converted into the content characteristics through the flow model.

As a preferable solution, the method of calculating the first content feature C from the text data includes:

The first step is as follows: the method comprises the steps of obtaining pinyin through a tool from a character form to a phoneme, and then splitting initial consonants and vowels of the pinyin to obtain the phoneme corresponding to the text.

The second step is that: and forming a phoneme dictionary by all phonemes, wherein the number of the phoneme dictionaries is used as the dimension of the embedding layer, and representing the phonemes of the text.

The third step: the characterized features are encoded by a CBHG module, which includes a one-dimensional convolutional filter bank, a highway network, and a recurrent neural network of bi-directional gated cyclic units.

Preferably, the decoder has the same structure as the HiFi-GAN generator.

Example 2:

a speech conversion method includes a stream model trained according to the speech conversion model training method disclosed in embodiment 1, wherein a first speech is a source speech, and a second speech is a target speech, i.e., first speaker information with the first speech removed is converted into a target speech with target speaker speech information of the second speech.

The method comprises the following steps:

and inputting the second speaking characteristics and the first audio characteristics into a flow model to obtain a second hidden variable Z2, and decoding the second hidden variable to generate a target audio.

Example 3

A speech conversion model training apparatus comprising:

the main controller is used for acquiring first voice, calculating first spectrum characteristics, calculating text data which is the same as the first voice, and controlling the input and the output of the data among the content encoder, the posterior encoder, the stream model unit and the decoder;

coding the text content of the source speaker voice through a deep learning model to obtain content coding information irrelevant to the source speaker information;

The trained stream model unit realizes the conversion of the content characteristics and the hidden variables, namely, the input of the content characteristics to the stream model unit can be realized, and the stream model unit outputs the hidden variables.

Example 4

A speech conversion model training apparatus comprising:

the content encoder is used for encoding the text content of the source speaker voice through a deep learning model to obtain content encoding information irrelevant to the source speaker information;

a target encoder for extracting a target speaker feature vector from the voice of the target speaker S2;

the flow model unit of the embodiment is a trained flow model unit and realizes the conversion of content characteristics and hidden variables;

and the decoder inputs the hidden variable parameters and outputs the audio frequency of the target speaker.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical functional division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another device, or some features may be omitted, or not executed.

The units may or may not be physically separate, and components displayed as units may be one physical unit or a plurality of physical units, that is, may be located in one place, or may be distributed in a plurality of different places. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a readable storage medium. Based on such understanding, the technical solution of the embodiments of the present invention may be essentially or partially contributed to by the prior art, or all or part of the technical solution may be embodied in the form of a software product, where the software product is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A speech conversion model training method is characterized in that,

acquiring first voice and text data with the same content as the first voice, and calculating a first content characteristic by using the text data;

2. The method for training a speech conversion model according to claim 1, wherein the extracting a spectral feature of the first speech and outputting a first spectral feature, and the specific method for calculating the first speaker feature and the first hidden variable through the first spectral feature comprises:

3. The method for training a speech conversion model according to claim 1, wherein the extracting a spectral feature of the first speech and outputting a first spectral feature, and the specific method for calculating the first speaker feature and the first hidden variable through the first spectral feature comprises:

4. The method of claim 1, wherein the flow model comprises a plurality of WaveNet residual blocks for constructing a mapping relationship between content features and hidden variables,

5. The method of claim 1, wherein computing the first content feature from the text data comprises:

6. A speech conversion method comprising a flow model trained by the speech conversion model training method according to claims 1-5, and further comprising the steps of:

acquiring a first audio characteristic P1 irrelevant to the information of the source audio speaker;

inputting the second speaking characteristic and the first audio characteristic into a stream model to obtain a second hidden variable Z2, and decoding the second hidden variable to generate a target audio;

7. The speech conversion method of claim 6, wherein the speaker information comprises speaker timbre.

8. The method of claim 6, wherein the step of obtaining the first audio characteristic independent of the source audio speaker information comprises:

9. A speech conversion model training apparatus, comprising: the main controller is used for acquiring first voice, calculating first spectrum characteristics, calculating text data which is the same as the first voice, and controlling the input and the output of the data among the content encoder, the posterior encoder, the stream model unit and the decoder;

10. A speech conversion apparatus, comprising:

a speech conversion model training apparatus comprising: